Database Design, Indexing & Query Optimization

Why This Is the Most Valuable Database Skill

Writing SQL that returns correct results is easy. Writing queries that stay fast at 100 million rows — and a schema that stays maintainable at year 5 — is what separates senior engineers from the rest.

This guide covers the theory and practice: normal forms, index internals, execution plans, and the most common performance traps.

Normalization

Normalization reduces data redundancy and update anomalies. The practical target for most applications is 3NF — beyond that you're usually optimising for analytics, not application correctness.

First Normal Form (1NF)

Each column stores a single atomic value. No repeating groups.

SQL

-- BAD: tags column stores multiple values
CREATE TABLE articles (
  id   INT,
  title TEXT,
  tags TEXT  -- 'react,hooks,performance' ← violates 1NF
);

-- GOOD: separate table
CREATE TABLE article_tags (
  article_id INT NOT NULL REFERENCES articles(id),
  tag        TEXT NOT NULL,
  PRIMARY KEY (article_id, tag)
);

Second Normal Form (2NF)

Every non-key column depends on the whole primary key (no partial dependencies).

SQL

-- BAD: order_items has partial dependency (product_name depends only on product_id)
CREATE TABLE order_items (
  order_id     INT,
  product_id   INT,
  product_name TEXT,    -- ← depends only on product_id, not (order_id, product_id)
  quantity     INT,
  PRIMARY KEY (order_id, product_id)
);

-- GOOD: product_name lives in products table
CREATE TABLE order_items (
  order_id   INT REFERENCES orders(id),
  product_id INT REFERENCES products(id),
  quantity   INT NOT NULL,
  unit_price NUMERIC(10,2) NOT NULL,  -- capture price AT TIME OF ORDER
  PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

No transitive dependencies — non-key columns must depend only on the primary key.

SQL

-- BAD: zip_code → city, state (transitive dependency)
CREATE TABLE customers (
  id        INT PRIMARY KEY,
  name      TEXT,
  zip_code  CHAR(5),
  city      TEXT,    -- ← depends on zip_code, not id
  state     CHAR(2)  -- ← depends on zip_code, not id
);

-- GOOD: extract zip codes
CREATE TABLE zip_codes (
  zip_code CHAR(5) PRIMARY KEY,
  city     TEXT,
  state    CHAR(2)
);
CREATE TABLE customers (
  id       INT PRIMARY KEY,
  name     TEXT,
  zip_code CHAR(5) REFERENCES zip_codes(zip_code)
);

When to Denormalize

Denormalize intentionally for read performance when joins are too expensive:

SQL

-- Orders reporting table (denormalized for analytics)
CREATE TABLE order_facts (
  order_id          BIGINT PRIMARY KEY,
  user_email        TEXT,       -- denormalized from users
  user_plan         TEXT,       -- denormalized from users
  product_names     TEXT[],     -- denormalized array
  total_cents       INT,
  status            TEXT,
  created_at        TIMESTAMPTZ,
  shipped_at        TIMESTAMPTZ,
  delivery_days     INT GENERATED ALWAYS AS (
    EXTRACT(DAY FROM shipped_at - created_at)
  ) STORED
);
-- Kept in sync via triggers or event-driven ETL

Index Internals

B-Tree Indexes

The default index type — a balanced tree where every leaf has the same depth.

          [50]
         /    \
      [25]    [75]
      /  \    /  \
   [10] [30][60] [90]

Properties:

O(log n) search, insert, delete
Supports: =, <, >, <=, >=, BETWEEN, LIKE 'prefix%'
Does NOT support: LIKE '%suffix', LIKE '%middle%'
Sorted — ORDER BY on indexed column avoids a sort step

SQL

-- These USE the B-tree index on (user_id, status, created_at):
WHERE user_id = 99
WHERE user_id = 99 AND status = 'shipped'
WHERE user_id = 99 AND status = 'shipped' AND created_at > '2026-01-01'
WHERE user_id = 99 AND created_at > '2026-01-01'  -- partial use (leading column)

-- These DO NOT use it efficiently:
WHERE status = 'shipped'        -- skips leading column (user_id)
WHERE created_at > '2026-01-01' -- skips first two columns

Hash Indexes

Only support exact equality (=). Faster than B-tree for pure equality lookups but can't do range scans. PostgreSQL supports explicit hash indexes; MySQL uses them for MEMORY engine and adaptive hash internally.

SQL

-- PostgreSQL hash index (equality only, faster than B-tree for =)
CREATE INDEX idx_sessions_token_hash ON sessions USING HASH(token);
-- Perfect for: WHERE token = $1

GIN (Generalized Inverted Index)

Used for full-text search, JSONB, and arrays. Stores a posting list per value.

SQL

-- Index every key/value pair inside JSONB
CREATE INDEX idx_products_attrs ON products USING GIN(attrs);

-- Index array elements
CREATE INDEX idx_articles_tags ON articles USING GIN(tags);

-- Full-text search vector
CREATE INDEX idx_articles_fts ON articles USING GIN(search_vector);

Partial Indexes — High ROI

Index only the rows you actually query. Can reduce index size by 80–95%.

SQL

-- Only index pending orders (5% of rows instead of 100%)
CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
WHERE status = 'pending';

-- Only index unverified emails
CREATE INDEX idx_users_unverified ON users(email, created_at)
WHERE email_verified = false;

Covering Indexes

Include all columns the query needs — PostgreSQL can answer the query from the index alone (index-only scan), avoiding a heap fetch.

SQL

-- Query: SELECT id, total_cents, status FROM orders WHERE user_id = 99
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (id, total_cents, status);
-- Result: index-only scan — no table access needed

Reading Execution Plans

PostgreSQL EXPLAIN ANALYZE

SQL

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.plan = 'pro'
GROUP BY u.id, u.email
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC;

Key nodes to understand:

| Node | Meaning | Red flag? | |---|---|---| | Seq Scan | Full table scan | ⚠ If table is large | | Index Scan | Uses index, fetches rows from heap | Good | | Index Only Scan | Uses covering index, no heap | Best | | Bitmap Heap Scan | Batches heap fetches | Good for many rows | | Hash Join | Builds hash table from smaller side | Good for large joins | | Nested Loop | Row-by-row join | ⚠ Bad if outer is large | | Sort | Explicit sort (no ORDER BY index) | ⚠ Add index | | Hash Aggregate | GROUP BY without index | Acceptable |

Seq Scan on users  (cost=0.00..8432.00 rows=12000)
                    ^^^^ estimated    ^^^^ row count estimate
  Filter: (plan = 'pro')
  Rows Removed by Filter: 88000
  Buffers: shared hit=3200 read=200
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  200 disk reads — needs an index on plan column

MySQL EXPLAIN

SQL

EXPLAIN FORMAT=JSON
SELECT * FROM orders WHERE user_id = 99 AND status = 'pending';

Key columns:

type: const → eq_ref → ref → range → ALL (worst: full scan)
key: which index was used
rows: estimated rows examined
Extra: "Using index" (good), "Using filesort" (sort without index), "Using temporary" (bad)

The N+1 Problem

The most common ORM-related performance issue.

// BAD: 1 query for orders + 1 query per order for user = N+1
var orders = await db.Orders.ToListAsync();               // 1 query
foreach (var order in orders)
{
    var user = await db.Users.FindAsync(order.UserId);    // N queries
    Console.WriteLine($"{user.Name}: {order.TotalCents}");
}

// GOOD: 1 query with JOIN (EF Core Include)
var orders = await db.Orders
    .Include(o => o.User)
    .ToListAsync();  // single JOIN query

// BETTER: project only what you need
var result = await db.Orders
    .Select(o => new { o.Id, o.TotalCents, UserName = o.User.Name })
    .ToListAsync();

JAVASCRIPT

// Mongoose N+1
const orders = await Order.find({ status: 'pending' });
for (const o of orders) {
  const user = await User.findById(o.userId);  // N queries!
}

// GOOD: populate (single $lookup)
const orders = await Order
  .find({ status: 'pending' })
  .populate('userId', 'name email');

Query Optimization Patterns

Pagination: OFFSET vs Keyset

SQL

-- BAD: OFFSET is slow at high page numbers (scans all skipped rows)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 50000;

-- GOOD: Keyset pagination (cursor-based) — O(log n) regardless of page
SELECT * FROM orders
WHERE created_at < $last_seen_created_at
  AND id < $last_seen_id              -- tiebreaker for same timestamp
ORDER BY created_at DESC, id DESC
LIMIT 20;

Avoid Functions on Indexed Columns

SQL

-- BAD: function prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-04-17';
SELECT * FROM users WHERE LOWER(email) = 'sarah@example.com';

-- GOOD: rewrite to range / functional index
SELECT * FROM orders
WHERE created_at >= '2026-04-17' AND created_at < '2026-04-18';

-- OR: create a functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'sarah@example.com';

COUNT(*) vs COUNT(column)

SQL

-- COUNT(*) — counts all rows including NULLs — fastest
SELECT COUNT(*) FROM orders;

-- COUNT(column) — counts non-NULL values only
SELECT COUNT(shipped_at) FROM orders;  -- only shipped orders

-- COUNT(DISTINCT column)
SELECT COUNT(DISTINCT user_id) FROM orders;  -- unique buyers

EXISTS vs IN vs JOIN for Subqueries

SQL

-- Checking existence — EXISTS is usually fastest
-- It short-circuits at first match
SELECT * FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.status = 'pending'
);

-- IN is fine for small lists; can be slow for large subqueries
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total_cents > 10000);

-- JOIN is best when you need columns from both tables
SELECT DISTINCT u.id, u.email
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.status = 'pending';

Deadlock Prevention

SQL

-- Always acquire locks in the same order across transactions
-- If transaction A locks users then orders, transaction B must too

-- Use SELECT ... FOR UPDATE with SKIP LOCKED for job queues
BEGIN;
SELECT * FROM job_queue
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

UPDATE job_queue SET status = 'processing', started_at = NOW()
WHERE id = $1;
COMMIT;
-- SKIP LOCKED: worker processes the first available row, skips locked ones
-- Multiple workers can run concurrently without deadlocking

Schema Migrations in Production

SQL

-- Safe column addition (non-blocking in PostgreSQL 11+, MySQL 8 InnoDB)
ALTER TABLE orders ADD COLUMN notes TEXT;         -- fast, default NULL
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0 NOT NULL; -- fast with DEFAULT

-- UNSAFE on large tables (avoid in production without blue/green deploy):
ALTER TABLE orders ALTER COLUMN status TYPE VARCHAR(50); -- rewrites table
ALTER TABLE orders ADD CONSTRAINT chk_status CHECK (status IN (...)); -- scans table

-- Safe pattern: expand → migrate → contract
-- Step 1: add new column (non-breaking)
ALTER TABLE users ADD COLUMN plan_v2 TEXT;
-- Step 2: backfill in batches (avoid lock)
UPDATE users SET plan_v2 = plan WHERE id BETWEEN $start AND $end;
-- Step 3: switch app to write both columns
-- Step 4: switch app to read plan_v2
-- Step 5: drop old column
ALTER TABLE users DROP COLUMN plan;
ALTER TABLE users RENAME COLUMN plan_v2 TO plan;

Key Takeaways

Normalize to 3NF by default — denormalize only when you have measured performance proof.
B-tree index column order matters — put equality columns first, range columns last.
Partial indexes are the highest ROI optimization in most production databases.
Covering indexes eliminate heap fetches — essential for hot read paths.
Keyset pagination is always faster than OFFSET at scale — use cursors in your API.
Never apply functions to indexed columns in WHERE clauses — rewrite the predicate or create a functional index.
Schema migrations on large tables must be done in expand-migrate-contract cycles to avoid locking production traffic.