.NET & C# Development · Lesson 180 of 229
PostgreSQL Indexing Strategies — B-tree, GIN, Partial, and Covering Indexes
PostgreSQL Indexing Strategies — B-tree, GIN, Partial, and Covering Indexes
The right index turns a sequential scan of 10M rows into a 0.1ms lookup. The wrong index slows writes, wastes storage, and doesn't even get used. Understanding when PostgreSQL picks an index — and when it doesn't — is essential for production database performance.
Index Types at a Glance
B-tree (default):
- Equality (=), range (<, >, BETWEEN), ORDER BY, LIKE 'prefix%'
- 95% of use cases
- CREATE INDEX idx ON orders(customer_id)
GIN (Generalized Inverted Index):
- Array contains, JSONB key/value lookup, full-text search
- CREATE INDEX idx ON products USING GIN(tags)
- CREATE INDEX idx ON docs USING GIN(body tsvector_ops)
GiST (Generalized Search Tree):
- Geometric data, range types, PostGIS geography
- CREATE INDEX idx ON reservations USING GiST(date_range)
BRIN (Block Range Index):
- Very large tables where data is physically ordered (e.g. time-series)
- Tiny index size (8KB vs GB for B-tree)
- CREATE INDEX idx ON sensor_data USING BRIN(recorded_at)
Hash:
- Equality only, slightly faster than B-tree for equality
- Not WAL-safe before PostgreSQL 10 — rarely usedB-tree Index Rules
-- Good: equality and range on indexed column
SELECT * FROM orders WHERE customer_id = 42;
SELECT * FROM orders WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01';
-- Good: LIKE with prefix (% at the end only)
SELECT * FROM customers WHERE email LIKE 'john%';
-- Bad: LIKE with leading wildcard — cannot use B-tree
SELECT * FROM customers WHERE email LIKE '%@example.com';
-- Use GIN with pg_trgm extension instead:
-- CREATE INDEX idx_email_trgm ON customers USING GIN(email gin_trgm_ops);
-- Bad: function applied to the indexed column kills index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-01-15';
-- Fix: expression index
CREATE INDEX idx_orders_date ON orders(DATE(created_at));
-- Or: use range instead of function
SELECT * FROM orders WHERE created_at >= '2026-01-15' AND created_at < '2026-01-16';Multicolumn Indexes — Column Order Matters
-- Index on (customer_id, created_at, status)
CREATE INDEX idx_orders_customer_date_status ON orders(customer_id, created_at, status);
-- USES the index (leading column present):
SELECT * FROM orders WHERE customer_id = 42;
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2026-01-01';
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2026-01-01' AND status = 'Paid';
-- DOES NOT use the index (leading column absent):
SELECT * FROM orders WHERE created_at > '2026-01-01';
SELECT * FROM orders WHERE status = 'Paid';
-- Rule: put the most selective equality column first,
-- then range columns, then additional equality columns
-- customer_id is high-cardinality → first
-- created_at is a range → second (can still use index for all preceding equality columns)
-- status is low-cardinality → lastPartial Indexes
-- Index only the rows you actually query — smaller, faster
-- Use case: soft-deleted rows — only active records matter for most queries
CREATE INDEX idx_orders_active ON orders(customer_id, created_at)
WHERE deleted_at IS NULL;
-- Query MUST include the WHERE clause to use this index
SELECT * FROM orders WHERE customer_id = 42 AND deleted_at IS NULL;
-- Another use case: only index pending orders (common in background workers)
CREATE INDEX idx_orders_pending ON orders(created_at, priority)
WHERE status = 'Pending';
SELECT * FROM orders WHERE status = 'Pending' ORDER BY priority DESC, created_at ASC;Covering Indexes (Index-Only Scans)
-- Include additional columns in the index to avoid table heap access
-- PostgreSQL can answer the query from the index alone
-- Without INCLUDE: index finds customer_id rows, then fetches table for total/status
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- With INCLUDE: index stores total and status — no heap access needed
CREATE INDEX idx_orders_customer_covering ON orders(customer_id)
INCLUDE (total, status, created_at);
-- Query uses index-only scan (fastest):
SELECT total, status, created_at FROM orders WHERE customer_id = 42;
-- INCLUDE columns:
-- - Not usable in WHERE, ORDER BY, JOIN (they're not key columns)
-- - Add columns that frequently appear in SELECT lists for hot queries
-- - Reduces I/O at the cost of index storageGIN Index for JSONB
-- JSONB column with GIN index supports key existence, containment, path queries
CREATE TABLE products (
id int PRIMARY KEY,
data jsonb
);
CREATE INDEX idx_products_data ON products USING GIN(data);
-- Key existence
SELECT * FROM products WHERE data ? 'color';
-- Containment
SELECT * FROM products WHERE data @> '{"category": "Electronics"}';
-- Path value (requires jsonb_path_ops GIN for @? operator)
CREATE INDEX idx_products_jsonb_path ON products USING GIN(data jsonb_path_ops);
SELECT * FROM products WHERE data @? '$.specs.weight ? (@ < 500)';EXPLAIN ANALYZE — Reading Query Plans
-- Always use EXPLAIN (ANALYZE, BUFFERS) — without ANALYZE it's hypothetical
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42 AND status = 'Pending';
-- Key plan nodes:
-- Index Scan: uses the index to find rows, then fetches from heap
-- Index Scan using idx_orders_customer on orders
-- Index Cond: (customer_id = 42)
-- Filter: (status = 'Pending') ← filter applied AFTER fetching rows
-- Index Only Scan: no heap access (covering index)
-- Index Only Scan using idx_orders_customer_covering on orders
-- Seq Scan: reads every row — ok for small tables or when fetching >10% of rows
-- Seq Scan on orders (cost=0.00..45231.00 rows=1 width=64)
-- Bitmap Index Scan + Bitmap Heap Scan: multiple indexes combined
-- good for queries with OR conditions or multiple non-leading columns
-- Nested Loop / Hash Join / Merge Join: join strategies
-- Cost numbers: (startup_cost..total_cost rows=estimated width=bytes_per_row)
-- Actual: (actual time=N..N rows=actual_rows loops=N)
-- Large discrepancy between estimated and actual rows = stale statistics
-- → RUN: ANALYZE orders;Index Maintenance
-- Bloat: deleted/updated rows leave dead tuples in the index
-- Check bloat:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(quote_ident(tablename))) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(quote_ident(tablename)) DESC;
-- REINDEX without locking (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY idx_orders_customer;
REINDEX TABLE CONCURRENTLY orders;
-- Find unused indexes (waste disk + slow writes):
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexname::regclass) DESC;
-- Update statistics when auto-analyze is too slow:
ANALYZE orders;
-- Or target specific columns:
ANALYZE orders (customer_id, status, created_at);Index Anti-Patterns
Anti-pattern 1: Index every column
→ Each write updates ALL indexes on the table
→ For a table with 10 indexes, every INSERT touches 10 index pages
Anti-pattern 2: Index low-cardinality columns alone
→ status with 3 values (Pending/Paid/Shipped) on a large table
→ PostgreSQL may prefer Seq Scan anyway (bitmap scan overhead not worth it)
→ Fix: multicolumn index with a high-cardinality leading column
Anti-pattern 3: Duplicate indexes
→ (customer_id) and (customer_id, created_at) — the first is redundant
→ B-tree index on (a, b) can satisfy queries on (a) alone
Anti-pattern 4: NULLS in indexed columns
→ B-tree stores NULLs — queries for IS NULL use the index
→ But WHERE col IS NOT NULL with high NULL ratio may still seq scan
Anti-pattern 5: Not indexing foreign keys
→ JOIN performance suffers dramatically on large tables
→ PostgreSQL does not auto-index foreign keys (unlike MySQL)
→ Always index FK columnsInterview Answer
"PostgreSQL has five index types: B-tree (default, 95% of cases — equality, range, ORDER BY, LIKE prefix), GIN (arrays, JSONB, full-text), GiST (geometry, range types, PostGIS), BRIN (huge time-ordered tables — stores min/max per block, tiny size), and Hash (equality only). Multicolumn index column order: put the highest-cardinality equality column first, then range columns, then additional filters — PostgreSQL can use the index for any prefix of the columns. Partial indexes filter by a WHERE clause — index only active records (WHERE deleted_at IS NULL), small and fast. Covering indexes with INCLUDE add non-key columns that appear in SELECT — enable index-only scans that skip the table heap entirely. To diagnose: EXPLAIN (ANALYZE, BUFFERS) — look for Seq Scan on large tables, large discrepancies between estimated and actual rows (stale stats → run ANALYZE), and filter conditions applied after index access (means the index doesn't cover the filter column). Unused indexes waste write performance — query pg_stat_user_indexes where idx_scan = 0."