Learnixo

.NET & C# Development · Lesson 180 of 229

PostgreSQL Indexing Strategies — B-tree, GIN, Partial, and Covering Indexes

PostgreSQL Indexing Strategies — B-tree, GIN, Partial, and Covering Indexes

The right index turns a sequential scan of 10M rows into a 0.1ms lookup. The wrong index slows writes, wastes storage, and doesn't even get used. Understanding when PostgreSQL picks an index — and when it doesn't — is essential for production database performance.


Index Types at a Glance

B-tree (default):
  - Equality (=), range (<, >, BETWEEN), ORDER BY, LIKE 'prefix%'
  - 95% of use cases
  - CREATE INDEX idx ON orders(customer_id)

GIN (Generalized Inverted Index):
  - Array contains, JSONB key/value lookup, full-text search
  - CREATE INDEX idx ON products USING GIN(tags)
  - CREATE INDEX idx ON docs USING GIN(body tsvector_ops)

GiST (Generalized Search Tree):
  - Geometric data, range types, PostGIS geography
  - CREATE INDEX idx ON reservations USING GiST(date_range)

BRIN (Block Range Index):
  - Very large tables where data is physically ordered (e.g. time-series)
  - Tiny index size (8KB vs GB for B-tree)
  - CREATE INDEX idx ON sensor_data USING BRIN(recorded_at)

Hash:
  - Equality only, slightly faster than B-tree for equality
  - Not WAL-safe before PostgreSQL 10 — rarely used

B-tree Index Rules

SQL
-- Good: equality and range on indexed column
SELECT * FROM orders WHERE customer_id = 42;
SELECT * FROM orders WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01';

-- Good: LIKE with prefix (% at the end only)
SELECT * FROM customers WHERE email LIKE 'john%';

-- Bad: LIKE with leading wildcard  cannot use B-tree
SELECT * FROM customers WHERE email LIKE '%@example.com';
-- Use GIN with pg_trgm extension instead:
-- CREATE INDEX idx_email_trgm ON customers USING GIN(email gin_trgm_ops);

-- Bad: function applied to the indexed column kills index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-01-15';
-- Fix: expression index
CREATE INDEX idx_orders_date ON orders(DATE(created_at));
-- Or: use range instead of function
SELECT * FROM orders WHERE created_at >= '2026-01-15' AND created_at < '2026-01-16';

Multicolumn Indexes — Column Order Matters

SQL
-- Index on (customer_id, created_at, status)
CREATE INDEX idx_orders_customer_date_status ON orders(customer_id, created_at, status);

-- USES the index (leading column present):
SELECT * FROM orders WHERE customer_id = 42;
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2026-01-01';
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2026-01-01' AND status = 'Paid';

-- DOES NOT use the index (leading column absent):
SELECT * FROM orders WHERE created_at > '2026-01-01';
SELECT * FROM orders WHERE status = 'Paid';

-- Rule: put the most selective equality column first,
--       then range columns, then additional equality columns
-- customer_id is high-cardinality  first
-- created_at is a range  second (can still use index for all preceding equality columns)
-- status is low-cardinality  last

Partial Indexes

SQL
-- Index only the rows you actually query  smaller, faster
-- Use case: soft-deleted rows  only active records matter for most queries

CREATE INDEX idx_orders_active ON orders(customer_id, created_at)
WHERE deleted_at IS NULL;

-- Query MUST include the WHERE clause to use this index
SELECT * FROM orders WHERE customer_id = 42 AND deleted_at IS NULL;

-- Another use case: only index pending orders (common in background workers)
CREATE INDEX idx_orders_pending ON orders(created_at, priority)
WHERE status = 'Pending';

SELECT * FROM orders WHERE status = 'Pending' ORDER BY priority DESC, created_at ASC;

Covering Indexes (Index-Only Scans)

SQL
-- Include additional columns in the index to avoid table heap access
-- PostgreSQL can answer the query from the index alone

-- Without INCLUDE: index finds customer_id rows, then fetches table for total/status
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- With INCLUDE: index stores total and status  no heap access needed
CREATE INDEX idx_orders_customer_covering ON orders(customer_id)
INCLUDE (total, status, created_at);

-- Query uses index-only scan (fastest):
SELECT total, status, created_at FROM orders WHERE customer_id = 42;

-- INCLUDE columns:
-- - Not usable in WHERE, ORDER BY, JOIN (they're not key columns)
-- - Add columns that frequently appear in SELECT lists for hot queries
-- - Reduces I/O at the cost of index storage

GIN Index for JSONB

SQL
-- JSONB column with GIN index supports key existence, containment, path queries
CREATE TABLE products (
    id   int PRIMARY KEY,
    data jsonb
);

CREATE INDEX idx_products_data ON products USING GIN(data);

-- Key existence
SELECT * FROM products WHERE data ? 'color';

-- Containment
SELECT * FROM products WHERE data @> '{"category": "Electronics"}';

-- Path value (requires jsonb_path_ops GIN for @? operator)
CREATE INDEX idx_products_jsonb_path ON products USING GIN(data jsonb_path_ops);
SELECT * FROM products WHERE data @? '$.specs.weight ? (@ < 500)';

EXPLAIN ANALYZE — Reading Query Plans

SQL
-- Always use EXPLAIN (ANALYZE, BUFFERS)  without ANALYZE it's hypothetical
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42 AND status = 'Pending';

-- Key plan nodes:

-- Index Scan: uses the index to find rows, then fetches from heap
--   Index Scan using idx_orders_customer on orders
--   Index Cond: (customer_id = 42)
--   Filter: (status = 'Pending')   ← filter applied AFTER fetching rows

-- Index Only Scan: no heap access (covering index)
--   Index Only Scan using idx_orders_customer_covering on orders

-- Seq Scan: reads every row — ok for small tables or when fetching >10% of rows
--   Seq Scan on orders  (cost=0.00..45231.00 rows=1 width=64)

-- Bitmap Index Scan + Bitmap Heap Scan: multiple indexes combined
--   good for queries with OR conditions or multiple non-leading columns

-- Nested Loop / Hash Join / Merge Join: join strategies

-- Cost numbers: (startup_cost..total_cost rows=estimated width=bytes_per_row)
-- Actual: (actual time=N..N rows=actual_rows loops=N)
-- Large discrepancy between estimated and actual rows = stale statistics
-- → RUN: ANALYZE orders;

Index Maintenance

SQL
-- Bloat: deleted/updated rows leave dead tuples in the index
-- Check bloat:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(quote_ident(tablename))) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(quote_ident(tablename)) DESC;

-- REINDEX without locking (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY idx_orders_customer;
REINDEX TABLE CONCURRENTLY orders;

-- Find unused indexes (waste disk + slow writes):
SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexname::regclass) DESC;

-- Update statistics when auto-analyze is too slow:
ANALYZE orders;
-- Or target specific columns:
ANALYZE orders (customer_id, status, created_at);

Index Anti-Patterns

Anti-pattern 1: Index every column
  → Each write updates ALL indexes on the table
  → For a table with 10 indexes, every INSERT touches 10 index pages

Anti-pattern 2: Index low-cardinality columns alone
  → status with 3 values (Pending/Paid/Shipped) on a large table
  → PostgreSQL may prefer Seq Scan anyway (bitmap scan overhead not worth it)
  → Fix: multicolumn index with a high-cardinality leading column

Anti-pattern 3: Duplicate indexes
  → (customer_id) and (customer_id, created_at) — the first is redundant
  → B-tree index on (a, b) can satisfy queries on (a) alone

Anti-pattern 4: NULLS in indexed columns
  → B-tree stores NULLs — queries for IS NULL use the index
  → But WHERE col IS NOT NULL with high NULL ratio may still seq scan

Anti-pattern 5: Not indexing foreign keys
  → JOIN performance suffers dramatically on large tables
  → PostgreSQL does not auto-index foreign keys (unlike MySQL)
  → Always index FK columns

Interview Answer

"PostgreSQL has five index types: B-tree (default, 95% of cases — equality, range, ORDER BY, LIKE prefix), GIN (arrays, JSONB, full-text), GiST (geometry, range types, PostGIS), BRIN (huge time-ordered tables — stores min/max per block, tiny size), and Hash (equality only). Multicolumn index column order: put the highest-cardinality equality column first, then range columns, then additional filters — PostgreSQL can use the index for any prefix of the columns. Partial indexes filter by a WHERE clause — index only active records (WHERE deleted_at IS NULL), small and fast. Covering indexes with INCLUDE add non-key columns that appear in SELECT — enable index-only scans that skip the table heap entirely. To diagnose: EXPLAIN (ANALYZE, BUFFERS) — look for Seq Scan on large tables, large discrepancies between estimated and actual rows (stale stats → run ANALYZE), and filter conditions applied after index access (means the index doesn't cover the filter column). Unused indexes waste write performance — query pg_stat_user_indexes where idx_scan = 0."