.NET & C# Development · Lesson 182 of 229

SQL Query Optimization — EXPLAIN ANALYZE, CTEs, and Query Plans

A slow query is rarely fixed by "add an index." You need to read the query plan, understand what PostgreSQL actually does, and rewrite the query to match the available access paths. This guide is a systematic approach.

Step 1: EXPLAIN ANALYZE — The Only Tool That Matters

SQL

-- Always use ANALYZE and BUFFERS — without them it's hypothetical
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
    o.id,
    o.total,
    c.name AS customer_name,
    COUNT(oi.id) AS item_count
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at >= '2026-01-01'
  AND o.status = 'Paid'
GROUP BY o.id, o.total, c.name
ORDER BY o.total DESC
LIMIT 20;

-- Reading the output:
Hash Join  (cost=2145.21..4521.33 rows=423 width=72)
           (actual time=45.231..82.456 rows=423 loops=1)
  Hash Cond: (oi.order_id = o.id)
  Buffers: shared hit=2341 read=124
  ->  Seq Scan on order_items oi    ← No index on oi.order_id — bad
        (cost=0.00..1234.00 rows=50000 width=16)
        (actual time=0.012..18.234 rows=50423 loops=1)
  ->  Hash  (cost=1500.00..1500.00 rows=423 width=60)
        Buckets: 1024  Batches: 1
        ->  Hash Join
              Hash Cond: (o.customer_id = c.id)
              ->  Index Scan using idx_orders_date on orders o
                    Index Cond: (created_at >= '2026-01-01')
                    Filter: (status = 'Paid')
                    Rows Removed by Filter: 1204    ← many rows fetched then filtered

Key numbers to find:
  rows estimate vs actual: large gap = stale stats → ANALYZE
  Seq Scan on large table = missing index
  "Rows Removed by Filter" on index scan = index doesn't include the filter column
  "read" in Buffers = disk I/O (hit = from cache)

Fix 1: Index Covering the Filter

SQL

-- Before: index on created_at, then filter removes 74% of rows
-- After: include status in the index

CREATE INDEX idx_orders_date_status ON orders(created_at, status)
WHERE status IN ('Paid', 'Pending');   -- partial index — only rows we care about

-- Or without partial:
CREATE INDEX idx_orders_date_status ON orders(created_at) INCLUDE (status, total, customer_id);
-- Covering index: the index-only scan provides created_at, status, total, customer_id
-- without touching the heap

Fix 2: N+1 in SQL — Subquery to JOIN

SQL

-- Anti-pattern: correlated subquery — runs once per row
SELECT
    o.id,
    o.total,
    (SELECT COUNT(*) FROM order_items oi WHERE oi.order_id = o.id) AS item_count,
    (SELECT SUM(oi.price) FROM order_items oi WHERE oi.order_id = o.id) AS item_total
FROM orders o
WHERE o.status = 'Paid';
-- → 2 subqueries × N orders = 2N queries total — O(N) performance

-- Fix: aggregate in a subquery and JOIN once
SELECT
    o.id,
    o.total,
    agg.item_count,
    agg.item_total
FROM orders o
JOIN (
    SELECT order_id, COUNT(*) AS item_count, SUM(price) AS item_total
    FROM order_items
    GROUP BY order_id
) agg ON agg.order_id = o.id
WHERE o.status = 'Paid';
-- → 1 query, 1 aggregation pass, 1 join — O(N) data, O(1) queries

CTEs — Materialisation Gotcha

SQL

-- Pre-PostgreSQL 12: CTEs were always materialised (executed once, stored)
-- PostgreSQL 12+: CTE is inlined by default UNLESS you write MATERIALIZED

-- Inline (default PG12+) — planner can push predicates inside:
WITH active_orders AS (
    SELECT * FROM orders WHERE status != 'Cancelled'
)
SELECT * FROM active_orders WHERE customer_id = 42;
-- → planner may rewrite as: WHERE status != 'Cancelled' AND customer_id = 42
-- → can use a multicolumn index on (customer_id, status)

-- Forced materialization — execute once regardless of outer filters:
WITH MATERIALIZED expensive_aggregation AS (
    SELECT customer_id, SUM(total) AS lifetime_value
    FROM orders
    GROUP BY customer_id
)
SELECT c.name, ea.lifetime_value
FROM customers c
JOIN expensive_aggregation ea ON ea.customer_id = c.id
WHERE ea.lifetime_value > 10000;
-- → aggregation runs once, result stored, then joined
-- → use when the CTE result is referenced multiple times

-- Rule: don't use CTEs for performance unless you understand materialisation
-- For readability only: CTEs are fine
-- For performance: measure with EXPLAIN ANALYZE before and after

Window Functions — Avoid Self-Joins

SQL

-- Anti-pattern: self-join to get "previous row" value
SELECT
    a.order_id,
    a.created_at,
    a.total,
    b.total AS previous_total
FROM orders a
LEFT JOIN orders b ON b.customer_id = a.customer_id
    AND b.created_at = (
        SELECT MAX(created_at) FROM orders
        WHERE customer_id = a.customer_id AND created_at < a.created_at
    );
-- → Expensive: correlated subquery per row

-- Fix: window function
SELECT
    order_id,
    created_at,
    total,
    LAG(total) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
    ) AS previous_total,
    total - LAG(total) OVER (
        PARTITION BY customer_id ORDER BY created_at
    ) AS change
FROM orders
WHERE status = 'Paid';
-- → Single pass, no self-join

SQL

-- Running totals
SELECT
    order_id,
    customer_id,
    total,
    SUM(total) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_spend
FROM orders;

-- Rank within group
SELECT
    product_id,
    category,
    revenue,
    RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_in_category
FROM product_sales;

-- Top N per group (without LIMIT — works with any group count)
SELECT * FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS rn
    FROM product_sales
) ranked
WHERE rn <= 3;   -- top 3 per category

Common Anti-Patterns and Fixes

SQL

-- Anti-pattern: SELECT * in production queries
SELECT * FROM orders;                     -- fetches all columns, even unused ones
SELECT id, status, total FROM orders;    -- only what you need

-- Anti-pattern: OR conditions prevent index use
SELECT * FROM orders WHERE status = 'Pending' OR status = 'Processing';
-- Fix: UNION ALL (if statuses are indexed separately)
SELECT * FROM orders WHERE status = 'Pending'
UNION ALL
SELECT * FROM orders WHERE status = 'Processing';
-- Or: IN clause (planner handles this well)
SELECT * FROM orders WHERE status IN ('Pending', 'Processing');

-- Anti-pattern: NOT IN with nullable column
SELECT * FROM orders WHERE customer_id NOT IN (
    SELECT customer_id FROM blacklisted_customers
);
-- If blacklisted_customers.customer_id has any NULL, result is empty (SQL NULL semantics)
-- Fix:
SELECT * FROM orders o
WHERE NOT EXISTS (
    SELECT 1 FROM blacklisted_customers bc WHERE bc.customer_id = o.customer_id
);

-- Anti-pattern: DISTINCT to hide a bad join
SELECT DISTINCT o.id, o.total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id;
-- DISTINCT is hiding that the JOIN produces duplicate rows (1-to-many)
-- Fix: use EXISTS or aggregate
SELECT o.id, o.total
FROM orders o
WHERE EXISTS (SELECT 1 FROM order_items oi WHERE oi.order_id = o.id);

-- Anti-pattern: implicit type cast kills index
SELECT * FROM orders WHERE customer_id = '42';  -- customer_id is int, '42' is text
-- PostgreSQL casts '42' to int but may not use the index on some plans
-- Fix: use the correct type
SELECT * FROM orders WHERE customer_id = 42;

Partitioning for Very Large Tables

SQL

-- Range partitioning by date — each partition can be indexed independently
CREATE TABLE orders (
    id         bigint GENERATED ALWAYS AS IDENTITY,
    created_at timestamptz NOT NULL,
    status     text,
    total      numeric(12,2),
    customer_id int
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

CREATE TABLE orders_2026 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

-- Queries with a created_at filter only scan the relevant partition
-- → 100M row table becomes 12 10M partitions
-- → DROP old partitions to archive data instantly (vs DELETE which is slow)

Interview Answer

"SQL query optimization starts with EXPLAIN (ANALYZE, BUFFERS) — without ANALYZE the costs are estimates, not measurements. Key red flags: Seq Scan on a table with millions of rows (missing index or wrong column), large gap between estimated rows and actual rows (stale statistics — run ANALYZE), and 'Rows Removed by Filter' after an index scan (the filter column is not in the index). N+1 in SQL is the correlated subquery anti-pattern — a subquery in SELECT that runs once per row. Fix: move the subquery into a JOIN with GROUP BY, or use a window function. CTEs in PostgreSQL 12+ are inlined by default — the planner can push filters inside them. Force MATERIALIZED only when the CTE is referenced multiple times and the aggregation is expensive. Window functions (LAG, RANK, ROW_NUMBER, SUM OVER) eliminate self-joins for running totals, rankings, and adjacent-row comparisons — single pass instead of O(N) subqueries. NOT IN with nullable columns silently returns empty results — use NOT EXISTS instead. Partitioning by range lets you drop old data instantly and constrains index and vacuum scope to one partition."

Redis Patterns in .NET — Pub/Sub, Streams, Sorted Sets, and Lua Scripts

Next Lesson

System Design Scenarios in .NET — Walkthroughs for Senior Interviews