Data Engineering

Data Engineering Fundamentals: ETL, ELT, Batch vs Streaming

Master the core concepts of data engineering — ETL vs ELT, pipeline stages, batch vs streaming, data pipeline architecture, and the tools every data engineer uses daily.

dbt Fundamentals: Models, Sources, Refs, and Materializations

Master dbt (data build tool) from scratch — project setup, SQL models, sources, refs, materializations, seeds, and the analytics engineering workflow used in modern data teams.

PySpark Fundamentals: Architecture, DataFrames, and Your First Pipeline

Master Apache Spark from the ground up — understand the driver/executor model, RDDs vs DataFrames, schema handling, lazy evaluation, and build a complete CSV-to-Parquet pipeline.

pytest Fundamentals for Data Engineers

Master pytest from the ground up: configuration, test discovery, assert rewriting, markers, filtering, and testing real pandas pipelines and ETL classes with production-quality patterns.

16 min readMay 7, 2026

SQL vs NoSQL: When to Use Which (And Why It Matters)

A complete decision framework for choosing between relational and non-relational databases. Covers consistency models, data shapes, scalability trade-offs, and the 9 major database categories used in production.

7 min readApr 17, 2026

SQL Fundamentals: Complete Guide with Real-World Examples

Master SQL from scratch with practical, real-world examples. Covers SELECT, WHERE, JOINs, GROUP BY, subqueries, and indexing — everything you need to query databases confidently.

9 min readApr 13, 2026

Intermediate

BigQuery Python Pipelines: From Raw Data to Production

Build production-grade data pipelines with the BigQuery Python client — authentication, schema enforcement, cost estimation, Airflow operators, dbt adapter config, and a complete API-to-BigQuery ETL.

13 min readMay 7, 2026

BigQuery SQL Analytics: Advanced Patterns for Data Engineers

Go beyond basic SELECT — master BigQuery's SQL dialect including UNNEST, window functions, scripting, stored procedures, BigQuery ML, geospatial queries, and a complete cohort retention analysis.

12 min readMay 7, 2026

Apache Airflow: Orchestrating Data Pipelines at Scale

Master Apache Airflow from scratch — DAGs, operators, sensors, XComs, task dependencies, scheduling, retries, and production deployment patterns used in real data engineering teams.

Data Pipeline Monitoring: Quality, Alerting, and Observability

Build observability into every data pipeline — data quality checks, row count validation, freshness monitoring, alerting, Great Expectations, and the operational patterns that keep pipelines reliable in production.

Data Pipeline Architecture: Medallion, Ingestion Patterns, and Data Contracts

Design production data pipelines with the medallion architecture, understand ingestion strategies (full load, incremental, CDC), define data contracts, and apply patterns used in real analytics engineering teams.

9 min readMay 7, 2026

Data Vault: Hubs, Links, Satellites, and Scalable Warehouse Architecture

Master Data Vault 2.0 — the enterprise data warehouse methodology that scales across sources and time. Hubs, Links, Satellites, loading patterns, business vault, and when to choose Data Vault over Kimball.

Dimensional Data Modelling: Star Schema, Facts, and Dimensions

Master dimensional modelling — star and snowflake schemas, fact tables, dimension tables, grain, surrogate keys, conformed dimensions, and the design principles used in real data warehouses.

Slowly Changing Dimensions: SCD Types 1, 2, 3, and 4

Master all SCD types with full SQL implementations — Type 1 (overwrite), Type 2 (full history), Type 3 (current + previous), Type 4 (history table), and how to implement them in Snowflake, dbt snapshots, and production pipelines.

Delta Lake on Databricks: ACID Transactions, Time Travel & Medallion Architecture

A production-depth guide to Delta Lake — ACID guarantees, MERGE upserts, time travel, schema enforcement, table constraints, OPTIMIZE with Z-ORDER, VACUUM, and a complete Bronze → Silver → Gold pipeline.

dbt Macros, Jinja Templating, Snapshots, and Advanced Patterns

Master dbt's power features — Jinja2 templating, reusable macros, hooks, snapshots for SCD Type 2, analyses, and the advanced patterns used in large-scale analytics engineering projects.

dbt in Production: CI/CD, Airflow Integration, and Multi-Environment Deployment

Deploy dbt at scale — multi-environment strategy, slim CI with state:modified, GitHub Actions pipelines, Airflow + dbt integration, job scheduling, and production operational patterns.

dbt Testing & Documentation: Schema Tests, Custom Tests, and Data Docs

Write comprehensive dbt tests — generic schema tests, singular SQL tests, custom test macros, dbt-expectations, and generate living documentation that your whole team can use.

PySpark DataFrames & Spark SQL: Transformations, Joins, and Window Functions

Deep dive into PySpark DataFrame operations — UDFs, built-in functions, all join types, broadcast joins, window functions, and a real Silver-layer SCD Type 2 transformation.

PySpark Structured Streaming: Kafka, Delta Lake, and Real-Time Pipelines

Build production-grade streaming pipelines with PySpark Structured Streaming — Kafka sources, watermarking, trigger strategies, foreachBatch sinks, fault tolerance, and Delta Live Tables.

pytest Fixtures and Parametrization for Data Pipelines

Build reusable test infrastructure with fixtures at every scope level, conftest.py, yield fixtures for teardown, fixture factories, and parametrize for exhaustive edge-case coverage in data engineering pipelines.

pytest Mocking and Patching for Data Pipeline Tests

Master unittest.mock, pytest-mock, monkeypatch, freezegun, and HTTP mocking to isolate external dependencies in Snowflake, S3, REST API, and datetime-sensitive pipeline tests.

15 min readMay 7, 2026

Statistics Foundations for Data Engineers

The statistical concepts every data engineer must know — from descriptive stats and distributions to the central limit theorem and hypothesis testing, with real pipeline examples.

Cloud Database Services: Azure, AWS & GCP Complete Comparison

Every managed database service on Azure, AWS, and Google Cloud — SQL, NoSQL, time series, search, and graph. Pricing models, when to use each, and how to choose.

Databases Complete Guide: SQL, NoSQL & When to Use Each

The definitive database reference — PostgreSQL, MySQL, SQLite, MongoDB, Redis, DynamoDB, Cassandra, ClickHouse, and more. Understand every major database, when to choose it, and how to use it.

18 min readApr 17, 2026

MongoDB: Complete Guide to Document Databases

Master MongoDB from data modeling to production deployment — schema design, aggregation pipelines, indexing, Atlas, Cosmos DB MongoDB API, and AWS DocumentDB.

MySQL, MariaDB & SQL Server: Production Guide

Complete guide to MySQL 8, MariaDB, and Microsoft SQL Server — differences, indexing, replication, stored procedures, Azure SQL, AWS RDS, and when to choose each over PostgreSQL.

PostgreSQL: The Developer's Complete Guide

Master PostgreSQL from setup to production — data types, indexing, JSONB, window functions, partitioning, replication, and managed cloud options on Azure, AWS, and GCP.

Redis: Beyond Caching — Complete Production Guide

Master Redis as a full data platform — strings, hashes, sorted sets, streams, pub/sub, Lua scripting, persistence, clustering, and managed options on Azure, AWS, and GCP.

7 min readApr 17, 2026

SQL Interview Questions: Medium Level (Q1–Q100)

100 SQL interview questions with detailed answers — joins, aggregations, subqueries, GROUP BY, HAVING, NULL handling, duplicates, and ranking. Covers the most commonly asked questions in tech interviews.

24 min readApr 13, 2026

SQL Real-World Project: E-Commerce Analytics Database

Build a complete e-commerce analytics system from scratch. Design the schema, load data, write complex reporting queries, and create a dashboard data layer — exactly as done in production.

11 min readApr 13, 2026

Building ETL Pipelines with Azure Data Factory

Design and build production data pipelines using Azure Data Factory — from source extraction to data lake storage with real-world patterns.

4 min readMar 25, 2026

Advanced

MLflow, Unity Catalog & Feature Store on Databricks

End-to-end ML on Databricks — MLflow experiment tracking with autolog, model registry, REST endpoint serving, Unity Catalog governance with GRANT/REVOKE and row/column security, Delta Sharing, and Feature Store workflows.

Advanced PySpark on Databricks: Delta Live Tables, Auto Loader & Streaming

Production PySpark patterns on Databricks — dbutils, parameterized notebooks, Delta Live Tables with data quality expectations, Auto Loader for incremental ingestion, Kafka streaming, cluster tuning, and Photon.

13 min readMay 7, 2026

Data Engineering Python & Pipeline Interview Questions (30 Questions)

30 Python and pipeline interview questions for data engineering roles — generators, context managers, pandas patterns, retry logic, idempotency, config management, and testing pipelines.

17 min readMay 7, 2026

Data Engineering SQL Interview Questions (50 Hard Questions)

50 production-level SQL interview questions for data engineering roles — window functions, CTEs, performance, pipeline SQL, and Snowflake/BigQuery-specific syntax, each with a complete working answer.

16 min readMay 7, 2026

Data Engineering System Design Interview Questions (20 Complete Answers)

20 data engineering system design interview questions with complete answers, architecture diagrams, and the key trade-offs interviewers expect you to discuss.

PySpark Performance Optimization: Partitions, Skew, AQE, and Delta Tuning

Diagnose and fix slow Spark jobs — understand the Spark UI, tune partitions, eliminate skew with salting, use AQE, leverage Delta Lake optimizations, and read explain() plans like a pro.

pytest for Complete Data Pipeline Testing

Test pandas transformations with assert_frame_equal, handle exceptions with pytest.raises, test CLI scripts and FastAPI endpoints, run real database tests with testcontainers, measure coverage, and integrate with GitHub Actions CI.

19 min readMay 7, 2026

Statistical Data Quality Checks for Pipelines

Build production-grade statistical quality checks into your data pipelines — outlier detection, distribution drift, null rate monitoring, and a complete Python DataQualityChecker class.

Time Series Analysis and Anomaly Detection for Data Engineers

Master time series fundamentals, decomposition, moving averages, and anomaly detection techniques — 3-sigma, CUSUM, Isolation Forest, and Prophet — with Python examples for pipeline monitoring.

12 min readMay 7, 2026

PostgreSQL Advanced Features — Deep Dive

Expert-level PostgreSQL — window function frame specifications, recursive CTEs, JSONB operators and GIN indexing, partitioning strategies (range/list/hash), Row-Level Security for multi-tenant SaaS, pgvector with HNSW for AI, LISTEN/NOTIFY, and advisory locks.

18 min readApr 18, 2026

PostgreSQL Indexing & Query Performance — Deep Dive

Master PostgreSQL performance engineering — B-tree internals, every index type (GIN, GiST, BRIN, pg_trgm), covering indexes, EXPLAIN ANALYZE at depth, statistics, autovacuum, bloat, and production query tuning patterns.

15 min readApr 18, 2026

Cassandra & DynamoDB: Wide-Column Databases for Massive Scale

When billions of writes per day aren't optional — master Cassandra data modeling, CQL, AWS DynamoDB single-table design, and cloud-managed wide-column services on Azure, AWS, and GCP.

7 min readApr 17, 2026

Database Design, Indexing & Query Optimization

The skills that separate junior from senior engineers — normalization, schema anti-patterns, index internals (B-tree, hash, GIN), query execution plans, and N+1 fixes across SQL and NoSQL.

10 min readApr 17, 2026

Project: Multi-Database E-commerce Architecture

Build a production-grade e-commerce backend that combines PostgreSQL, MongoDB, Redis, and Elasticsearch — each database doing what it does best. Full schema, queries, and architecture decisions.

9 min readApr 17, 2026

Advanced SQL: Window Functions, CTEs, and Performance Tuning

Go beyond basic queries. Master window functions (RANK, LAG, NTILE), recursive CTEs, query optimization, execution plans, and advanced patterns used in production systems.

10 min readApr 13, 2026

SQL Interview Questions: Advanced Level (Q101–Q200)

100 advanced SQL interview questions with answers — window functions, CTEs, recursive queries, query optimization, execution plans, locking, and complex analytical patterns used in senior-level interviews.

25 min readApr 13, 2026