Data Engineering
Design and build data pipelines that power AI systems. ETL/ELT, data modeling, and cloud-scale storage.
Beginner
BigQuery Fundamentals for Data Engineers
Master Google BigQuery from the ground up — architecture, storage, partitioning, clustering, loading strategies, and the full SQL feature set data engineers use in production.
Data Engineering Fundamentals: ETL, ELT, Batch vs Streaming
Master the core concepts of data engineering — ETL vs ELT, pipeline stages, batch vs streaming, data pipeline architecture, and the tools every data engineer uses daily.
dbt Fundamentals: Models, Sources, Refs, and Materializations
Master dbt (data build tool) from scratch — project setup, SQL models, sources, refs, materializations, seeds, and the analytics engineering workflow used in modern data teams.
PySpark Fundamentals: Architecture, DataFrames, and Your First Pipeline
Master Apache Spark from the ground up — understand the driver/executor model, RDDs vs DataFrames, schema handling, lazy evaluation, and build a complete CSV-to-Parquet pipeline.
pytest Fundamentals for Data Engineers
Master pytest from the ground up: configuration, test discovery, assert rewriting, markers, filtering, and testing real pandas pipelines and ETL classes with production-quality patterns.
SQL vs NoSQL: When to Use Which (And Why It Matters)
A complete decision framework for choosing between relational and non-relational databases. Covers consistency models, data shapes, scalability trade-offs, and the 9 major database categories used in production.
SQL Fundamentals: Complete Guide with Real-World Examples
Master SQL from scratch with practical, real-world examples. Covers SELECT, WHERE, JOINs, GROUP BY, subqueries, and indexing — everything you need to query databases confidently.
Intermediate
BigQuery Python Pipelines: From Raw Data to Production
Build production-grade data pipelines with the BigQuery Python client — authentication, schema enforcement, cost estimation, Airflow operators, dbt adapter config, and a complete API-to-BigQuery ETL.
BigQuery SQL Analytics: Advanced Patterns for Data Engineers
Go beyond basic SELECT — master BigQuery's SQL dialect including UNNEST, window functions, scripting, stored procedures, BigQuery ML, geospatial queries, and a complete cohort retention analysis.
Apache Airflow: Orchestrating Data Pipelines at Scale
Master Apache Airflow from scratch — DAGs, operators, sensors, XComs, task dependencies, scheduling, retries, and production deployment patterns used in real data engineering teams.
Data Pipeline Monitoring: Quality, Alerting, and Observability
Build observability into every data pipeline — data quality checks, row count validation, freshness monitoring, alerting, Great Expectations, and the operational patterns that keep pipelines reliable in production.
Data Pipeline Architecture: Medallion, Ingestion Patterns, and Data Contracts
Design production data pipelines with the medallion architecture, understand ingestion strategies (full load, incremental, CDC), define data contracts, and apply patterns used in real analytics engineering teams.
Data Vault: Hubs, Links, Satellites, and Scalable Warehouse Architecture
Master Data Vault 2.0 — the enterprise data warehouse methodology that scales across sources and time. Hubs, Links, Satellites, loading patterns, business vault, and when to choose Data Vault over Kimball.
Dimensional Data Modelling: Star Schema, Facts, and Dimensions
Master dimensional modelling — star and snowflake schemas, fact tables, dimension tables, grain, surrogate keys, conformed dimensions, and the design principles used in real data warehouses.
Slowly Changing Dimensions: SCD Types 1, 2, 3, and 4
Master all SCD types with full SQL implementations — Type 1 (overwrite), Type 2 (full history), Type 3 (current + previous), Type 4 (history table), and how to implement them in Snowflake, dbt snapshots, and production pipelines.
Delta Lake on Databricks: ACID Transactions, Time Travel & Medallion Architecture
A production-depth guide to Delta Lake — ACID guarantees, MERGE upserts, time travel, schema enforcement, table constraints, OPTIMIZE with Z-ORDER, VACUUM, and a complete Bronze → Silver → Gold pipeline.
dbt Macros, Jinja Templating, Snapshots, and Advanced Patterns
Master dbt's power features — Jinja2 templating, reusable macros, hooks, snapshots for SCD Type 2, analyses, and the advanced patterns used in large-scale analytics engineering projects.
dbt in Production: CI/CD, Airflow Integration, and Multi-Environment Deployment
Deploy dbt at scale — multi-environment strategy, slim CI with state:modified, GitHub Actions pipelines, Airflow + dbt integration, job scheduling, and production operational patterns.
dbt Testing & Documentation: Schema Tests, Custom Tests, and Data Docs
Write comprehensive dbt tests — generic schema tests, singular SQL tests, custom test macros, dbt-expectations, and generate living documentation that your whole team can use.
PySpark DataFrames & Spark SQL: Transformations, Joins, and Window Functions
Deep dive into PySpark DataFrame operations — UDFs, built-in functions, all join types, broadcast joins, window functions, and a real Silver-layer SCD Type 2 transformation.
PySpark Structured Streaming: Kafka, Delta Lake, and Real-Time Pipelines
Build production-grade streaming pipelines with PySpark Structured Streaming — Kafka sources, watermarking, trigger strategies, foreachBatch sinks, fault tolerance, and Delta Live Tables.
pytest Fixtures and Parametrization for Data Pipelines
Build reusable test infrastructure with fixtures at every scope level, conftest.py, yield fixtures for teardown, fixture factories, and parametrize for exhaustive edge-case coverage in data engineering pipelines.
pytest Mocking and Patching for Data Pipeline Tests
Master unittest.mock, pytest-mock, monkeypatch, freezegun, and HTTP mocking to isolate external dependencies in Snowflake, S3, REST API, and datetime-sensitive pipeline tests.
Statistics Foundations for Data Engineers
The statistical concepts every data engineer must know — from descriptive stats and distributions to the central limit theorem and hypothesis testing, with real pipeline examples.
Cloud Database Services: Azure, AWS & GCP Complete Comparison
Every managed database service on Azure, AWS, and Google Cloud — SQL, NoSQL, time series, search, and graph. Pricing models, when to use each, and how to choose.
Databases Complete Guide: SQL, NoSQL & When to Use Each
The definitive database reference — PostgreSQL, MySQL, SQLite, MongoDB, Redis, DynamoDB, Cassandra, ClickHouse, and more. Understand every major database, when to choose it, and how to use it.
MongoDB: Complete Guide to Document Databases
Master MongoDB from data modeling to production deployment — schema design, aggregation pipelines, indexing, Atlas, Cosmos DB MongoDB API, and AWS DocumentDB.
MySQL, MariaDB & SQL Server: Production Guide
Complete guide to MySQL 8, MariaDB, and Microsoft SQL Server — differences, indexing, replication, stored procedures, Azure SQL, AWS RDS, and when to choose each over PostgreSQL.
PostgreSQL: The Developer's Complete Guide
Master PostgreSQL from setup to production — data types, indexing, JSONB, window functions, partitioning, replication, and managed cloud options on Azure, AWS, and GCP.
Redis: Beyond Caching — Complete Production Guide
Master Redis as a full data platform — strings, hashes, sorted sets, streams, pub/sub, Lua scripting, persistence, clustering, and managed options on Azure, AWS, and GCP.
SQL Interview Questions: Medium Level (Q1–Q100)
100 SQL interview questions with detailed answers — joins, aggregations, subqueries, GROUP BY, HAVING, NULL handling, duplicates, and ranking. Covers the most commonly asked questions in tech interviews.
SQL Real-World Project: E-Commerce Analytics Database
Build a complete e-commerce analytics system from scratch. Design the schema, load data, write complex reporting queries, and create a dashboard data layer — exactly as done in production.
Building ETL Pipelines with Azure Data Factory
Design and build production data pipelines using Azure Data Factory — from source extraction to data lake storage with real-world patterns.
Advanced
MLflow, Unity Catalog & Feature Store on Databricks
End-to-end ML on Databricks — MLflow experiment tracking with autolog, model registry, REST endpoint serving, Unity Catalog governance with GRANT/REVOKE and row/column security, Delta Sharing, and Feature Store workflows.
Advanced PySpark on Databricks: Delta Live Tables, Auto Loader & Streaming
Production PySpark patterns on Databricks — dbutils, parameterized notebooks, Delta Live Tables with data quality expectations, Auto Loader for incremental ingestion, Kafka streaming, cluster tuning, and Photon.
Data Engineering Python & Pipeline Interview Questions (30 Questions)
30 Python and pipeline interview questions for data engineering roles — generators, context managers, pandas patterns, retry logic, idempotency, config management, and testing pipelines.
Data Engineering SQL Interview Questions (50 Hard Questions)
50 production-level SQL interview questions for data engineering roles — window functions, CTEs, performance, pipeline SQL, and Snowflake/BigQuery-specific syntax, each with a complete working answer.
Data Engineering System Design Interview Questions (20 Complete Answers)
20 data engineering system design interview questions with complete answers, architecture diagrams, and the key trade-offs interviewers expect you to discuss.
PySpark Performance Optimization: Partitions, Skew, AQE, and Delta Tuning
Diagnose and fix slow Spark jobs — understand the Spark UI, tune partitions, eliminate skew with salting, use AQE, leverage Delta Lake optimizations, and read explain() plans like a pro.
pytest for Complete Data Pipeline Testing
Test pandas transformations with assert_frame_equal, handle exceptions with pytest.raises, test CLI scripts and FastAPI endpoints, run real database tests with testcontainers, measure coverage, and integrate with GitHub Actions CI.
Statistical Data Quality Checks for Pipelines
Build production-grade statistical quality checks into your data pipelines — outlier detection, distribution drift, null rate monitoring, and a complete Python DataQualityChecker class.
Time Series Analysis and Anomaly Detection for Data Engineers
Master time series fundamentals, decomposition, moving averages, and anomaly detection techniques — 3-sigma, CUSUM, Isolation Forest, and Prophet — with Python examples for pipeline monitoring.
PostgreSQL Advanced Features — Deep Dive
Expert-level PostgreSQL — window function frame specifications, recursive CTEs, JSONB operators and GIN indexing, partitioning strategies (range/list/hash), Row-Level Security for multi-tenant SaaS, pgvector with HNSW for AI, LISTEN/NOTIFY, and advisory locks.
PostgreSQL Indexing & Query Performance — Deep Dive
Master PostgreSQL performance engineering — B-tree internals, every index type (GIN, GiST, BRIN, pg_trgm), covering indexes, EXPLAIN ANALYZE at depth, statistics, autovacuum, bloat, and production query tuning patterns.
Cassandra & DynamoDB: Wide-Column Databases for Massive Scale
When billions of writes per day aren't optional — master Cassandra data modeling, CQL, AWS DynamoDB single-table design, and cloud-managed wide-column services on Azure, AWS, and GCP.
Database Design, Indexing & Query Optimization
The skills that separate junior from senior engineers — normalization, schema anti-patterns, index internals (B-tree, hash, GIN), query execution plans, and N+1 fixes across SQL and NoSQL.
Project: Multi-Database E-commerce Architecture
Build a production-grade e-commerce backend that combines PostgreSQL, MongoDB, Redis, and Elasticsearch — each database doing what it does best. Full schema, queries, and architecture decisions.
Advanced SQL: Window Functions, CTEs, and Performance Tuning
Go beyond basic queries. Master window functions (RANK, LAG, NTILE), recursive CTEs, query optimization, execution plans, and advanced patterns used in production systems.
SQL Interview Questions: Advanced Level (Q101–Q200)
100 advanced SQL interview questions with answers — window functions, CTEs, recursive queries, query optimization, execution plans, locking, and complex analytical patterns used in senior-level interviews.
SQL Interview Questions: Expert Level (Q201–Q300)
100 expert SQL interview questions with answers — database internals, query optimization, distributed databases, data warehousing, OLAP, partitioning, sharding, and system design. For senior engineer and staff-level interviews.