Database Design, Indexing & Query Optimization
The skills that separate junior from senior engineers ā normalization, schema anti-patterns, index internals (B-tree, hash, GIN), query execution plans, and N+1 fixes across SQL and NoSQL.
Why This Is the Most Valuable Database Skill
Writing SQL that returns correct results is easy. Writing queries that stay fast at 100 million rows ā and a schema that stays maintainable at year 5 ā is what separates senior engineers from the rest.
This guide covers the theory and practice: normal forms, index internals, execution plans, and the most common performance traps.
Normalization
Normalization reduces data redundancy and update anomalies. The practical target for most applications is 3NF ā beyond that you're usually optimising for analytics, not application correctness.
First Normal Form (1NF)
Each column stores a single atomic value. No repeating groups.
-- BAD: tags column stores multiple values
CREATE TABLE articles (
id INT,
title TEXT,
tags TEXT -- 'react,hooks,performance' ā violates 1NF
);
-- GOOD: separate table
CREATE TABLE article_tags (
article_id INT NOT NULL REFERENCES articles(id),
tag TEXT NOT NULL,
PRIMARY KEY (article_id, tag)
);Second Normal Form (2NF)
Every non-key column depends on the whole primary key (no partial dependencies).
-- BAD: order_items has partial dependency (product_name depends only on product_id)
CREATE TABLE order_items (
order_id INT,
product_id INT,
product_name TEXT, -- ā depends only on product_id, not (order_id, product_id)
quantity INT,
PRIMARY KEY (order_id, product_id)
);
-- GOOD: product_name lives in products table
CREATE TABLE order_items (
order_id INT REFERENCES orders(id),
product_id INT REFERENCES products(id),
quantity INT NOT NULL,
unit_price NUMERIC(10,2) NOT NULL, -- capture price AT TIME OF ORDER
PRIMARY KEY (order_id, product_id)
);Third Normal Form (3NF)
No transitive dependencies ā non-key columns must depend only on the primary key.
-- BAD: zip_code ā city, state (transitive dependency)
CREATE TABLE customers (
id INT PRIMARY KEY,
name TEXT,
zip_code CHAR(5),
city TEXT, -- ā depends on zip_code, not id
state CHAR(2) -- ā depends on zip_code, not id
);
-- GOOD: extract zip codes
CREATE TABLE zip_codes (
zip_code CHAR(5) PRIMARY KEY,
city TEXT,
state CHAR(2)
);
CREATE TABLE customers (
id INT PRIMARY KEY,
name TEXT,
zip_code CHAR(5) REFERENCES zip_codes(zip_code)
);When to Denormalize
Denormalize intentionally for read performance when joins are too expensive:
-- Orders reporting table (denormalized for analytics)
CREATE TABLE order_facts (
order_id BIGINT PRIMARY KEY,
user_email TEXT, -- denormalized from users
user_plan TEXT, -- denormalized from users
product_names TEXT[], -- denormalized array
total_cents INT,
status TEXT,
created_at TIMESTAMPTZ,
shipped_at TIMESTAMPTZ,
delivery_days INT GENERATED ALWAYS AS (
EXTRACT(DAY FROM shipped_at - created_at)
) STORED
);
-- Kept in sync via triggers or event-driven ETLIndex Internals
B-Tree Indexes
The default index type ā a balanced tree where every leaf has the same depth.
[50]
/ \
[25] [75]
/ \ / \
[10] [30][60] [90]Properties:
- O(log n) search, insert, delete
- Supports:
=,<,>, <=, >=,BETWEEN,LIKE 'prefix%' - Does NOT support:
LIKE '%suffix',LIKE '%middle%' - Sorted ā ORDER BY on indexed column avoids a sort step
-- These USE the B-tree index on (user_id, status, created_at):
WHERE user_id = 99
WHERE user_id = 99 AND status = 'shipped'
WHERE user_id = 99 AND status = 'shipped' AND created_at > '2026-01-01'
WHERE user_id = 99 AND created_at > '2026-01-01' -- partial use (leading column)
-- These DO NOT use it efficiently:
WHERE status = 'shipped' -- skips leading column (user_id)
WHERE created_at > '2026-01-01' -- skips first two columnsHash Indexes
Only support exact equality (=). Faster than B-tree for pure equality lookups but can't do range scans. PostgreSQL supports explicit hash indexes; MySQL uses them for MEMORY engine and adaptive hash internally.
-- PostgreSQL hash index (equality only, faster than B-tree for =)
CREATE INDEX idx_sessions_token_hash ON sessions USING HASH(token);
-- Perfect for: WHERE token = $1GIN (Generalized Inverted Index)
Used for full-text search, JSONB, and arrays. Stores a posting list per value.
-- Index every key/value pair inside JSONB
CREATE INDEX idx_products_attrs ON products USING GIN(attrs);
-- Index array elements
CREATE INDEX idx_articles_tags ON articles USING GIN(tags);
-- Full-text search vector
CREATE INDEX idx_articles_fts ON articles USING GIN(search_vector);Partial Indexes ā High ROI
Index only the rows you actually query. Can reduce index size by 80ā95%.
-- Only index pending orders (5% of rows instead of 100%)
CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
WHERE status = 'pending';
-- Only index unverified emails
CREATE INDEX idx_users_unverified ON users(email, created_at)
WHERE email_verified = false;Covering Indexes
Include all columns the query needs ā PostgreSQL can answer the query from the index alone (index-only scan), avoiding a heap fetch.
-- Query: SELECT id, total_cents, status FROM orders WHERE user_id = 99
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (id, total_cents, status);
-- Result: index-only scan ā no table access neededReading Execution Plans
PostgreSQL EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.plan = 'pro'
GROUP BY u.id, u.email
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC;Key nodes to understand:
| Node | Meaning | Red flag? |
|---|---|---|
| Seq Scan | Full table scan | ā If table is large |
| Index Scan | Uses index, fetches rows from heap | Good |
| Index Only Scan | Uses covering index, no heap | Best |
| Bitmap Heap Scan | Batches heap fetches | Good for many rows |
| Hash Join | Builds hash table from smaller side | Good for large joins |
| Nested Loop | Row-by-row join | ā Bad if outer is large |
| Sort | Explicit sort (no ORDER BY index) | ā Add index |
| Hash Aggregate | GROUP BY without index | Acceptable |
Seq Scan on users (cost=0.00..8432.00 rows=12000)
^^^^ estimated ^^^^ row count estimate
Filter: (plan = 'pro')
Rows Removed by Filter: 88000
Buffers: shared hit=3200 read=200
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
200 disk reads ā needs an index on plan columnMySQL EXPLAIN
EXPLAIN FORMAT=JSON
SELECT * FROM orders WHERE user_id = 99 AND status = 'pending';Key columns:
type:constāeq_refārefārangeāALL(worst: full scan)key: which index was usedrows: estimated rows examinedExtra: "Using index" (good), "Using filesort" (sort without index), "Using temporary" (bad)
The N+1 Problem
The most common ORM-related performance issue.
// BAD: 1 query for orders + 1 query per order for user = N+1
var orders = await db.Orders.ToListAsync(); // 1 query
foreach (var order in orders)
{
var user = await db.Users.FindAsync(order.UserId); // N queries
Console.WriteLine($"{user.Name}: {order.TotalCents}");
}
// GOOD: 1 query with JOIN (EF Core Include)
var orders = await db.Orders
.Include(o => o.User)
.ToListAsync(); // single JOIN query
// BETTER: project only what you need
var result = await db.Orders
.Select(o => new { o.Id, o.TotalCents, UserName = o.User.Name })
.ToListAsync();// Mongoose N+1
const orders = await Order.find({ status: 'pending' });
for (const o of orders) {
const user = await User.findById(o.userId); // N queries!
}
// GOOD: populate (single $lookup)
const orders = await Order
.find({ status: 'pending' })
.populate('userId', 'name email');Query Optimization Patterns
Pagination: OFFSET vs Keyset
-- BAD: OFFSET is slow at high page numbers (scans all skipped rows)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 50000;
-- GOOD: Keyset pagination (cursor-based) ā O(log n) regardless of page
SELECT * FROM orders
WHERE created_at < $last_seen_created_at
AND id < $last_seen_id -- tiebreaker for same timestamp
ORDER BY created_at DESC, id DESC
LIMIT 20;Avoid Functions on Indexed Columns
-- BAD: function prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-04-17';
SELECT * FROM users WHERE LOWER(email) = 'sarah@example.com';
-- GOOD: rewrite to range / functional index
SELECT * FROM orders
WHERE created_at >= '2026-04-17' AND created_at < '2026-04-18';
-- OR: create a functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'sarah@example.com';COUNT(*) vs COUNT(column)
-- COUNT(*) ā counts all rows including NULLs ā fastest
SELECT COUNT(*) FROM orders;
-- COUNT(column) ā counts non-NULL values only
SELECT COUNT(shipped_at) FROM orders; -- only shipped orders
-- COUNT(DISTINCT column)
SELECT COUNT(DISTINCT user_id) FROM orders; -- unique buyersEXISTS vs IN vs JOIN for Subqueries
-- Checking existence ā EXISTS is usually fastest
-- It short-circuits at first match
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.status = 'pending'
);
-- IN is fine for small lists; can be slow for large subqueries
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total_cents > 10000);
-- JOIN is best when you need columns from both tables
SELECT DISTINCT u.id, u.email
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.status = 'pending';Deadlock Prevention
-- Always acquire locks in the same order across transactions
-- If transaction A locks users then orders, transaction B must too
-- Use SELECT ... FOR UPDATE with SKIP LOCKED for job queues
BEGIN;
SELECT * FROM job_queue
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
UPDATE job_queue SET status = 'processing', started_at = NOW()
WHERE id = $1;
COMMIT;
-- SKIP LOCKED: worker processes the first available row, skips locked ones
-- Multiple workers can run concurrently without deadlockingSchema Migrations in Production
-- Safe column addition (non-blocking in PostgreSQL 11+, MySQL 8 InnoDB)
ALTER TABLE orders ADD COLUMN notes TEXT; -- fast, default NULL
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0 NOT NULL; -- fast with DEFAULT
-- UNSAFE on large tables (avoid in production without blue/green deploy):
ALTER TABLE orders ALTER COLUMN status TYPE VARCHAR(50); -- rewrites table
ALTER TABLE orders ADD CONSTRAINT chk_status CHECK (status IN (...)); -- scans table
-- Safe pattern: expand ā migrate ā contract
-- Step 1: add new column (non-breaking)
ALTER TABLE users ADD COLUMN plan_v2 TEXT;
-- Step 2: backfill in batches (avoid lock)
UPDATE users SET plan_v2 = plan WHERE id BETWEEN $start AND $end;
-- Step 3: switch app to write both columns
-- Step 4: switch app to read plan_v2
-- Step 5: drop old column
ALTER TABLE users DROP COLUMN plan;
ALTER TABLE users RENAME COLUMN plan_v2 TO plan;Key Takeaways
- Normalize to 3NF by default ā denormalize only when you have measured performance proof.
- B-tree index column order matters ā put equality columns first, range columns last.
- Partial indexes are the highest ROI optimization in most production databases.
- Covering indexes eliminate heap fetches ā essential for hot read paths.
- Keyset pagination is always faster than OFFSET at scale ā use cursors in your API.
- Never apply functions to indexed columns in WHERE clauses ā rewrite the predicate or create a functional index.
- Schema migrations on large tables must be done in expand-migrate-contract cycles to avoid locking production traffic.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.