SQL Interview Questions: Advanced Level (Q101–Q200)
100 advanced SQL interview questions with answers — window functions, CTEs, recursive queries, query optimization, execution plans, locking, and complex analytical patterns used in senior-level interviews.
Advanced SQL Interview Prep
These questions are asked at senior developer, data engineer, and data scientist interviews at companies like Google, Amazon, Meta, Microsoft, and top-tier startups. They assume you've already mastered the fundamentals.
Window Functions Deep Dive
Q101: What is the OVER() clause with no arguments?
Applies the window function over the entire result set — no partition, no order.
-- Compare each order to the overall average
SELECT
order_id,
total_amount,
AVG(total_amount) OVER () AS overall_avg,
total_amount - AVG(total_amount) OVER () AS diff_from_avg
FROM orders;Q102: What is a window frame and how do you specify it?
Defines which rows relative to the current row are included in the window.
-- ROWS BETWEEN: physical row offsets
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- current + 2 previous rows
)
-- RANGE BETWEEN: logical value range
SUM(amount) OVER (
ORDER BY date
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
)
-- Unbounded: entire partition
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)Q103: How do you calculate a 30-day moving average?
WITH daily AS (
SELECT created_at::DATE AS day, SUM(total_amount) AS revenue
FROM orders WHERE status = 'delivered'
GROUP BY 1
),
date_spine AS (
SELECT generate_series(MIN(day), MAX(day), '1 day') AS day FROM daily
)
SELECT
ds.day,
COALESCE(d.revenue, 0) AS revenue,
AVG(COALESCE(d.revenue, 0)) OVER (
ORDER BY ds.day
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS moving_avg_30d
FROM date_spine ds
LEFT JOIN daily d ON ds.day = d.day;Q104: Calculate the percentage contribution of each order to the customer's total.
SELECT
o.order_id,
o.customer_id,
o.total_amount,
SUM(o.total_amount) OVER (PARTITION BY o.customer_id) AS customer_total,
ROUND(100.0 * o.total_amount /
NULLIF(SUM(o.total_amount) OVER (PARTITION BY o.customer_id), 0), 2
) AS pct_of_customer_total
FROM orders o
WHERE status = 'delivered';Q105: Find customers whose most recent order was a downgrade from their previous order.
WITH ordered AS (
SELECT
customer_id,
order_id,
total_amount,
LAG(total_amount) OVER (PARTITION BY customer_id ORDER BY created_at) AS prev_amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
FROM orders
WHERE status = 'delivered'
)
SELECT customer_id, order_id, total_amount, prev_amount
FROM ordered
WHERE rn = 1
AND total_amount < prev_amount;Q106: Find "gaps" in a sequence of IDs.
-- Find missing order_ids between 1 and MAX
WITH series AS (
SELECT generate_series(1, (SELECT MAX(order_id) FROM orders)) AS expected_id
)
SELECT expected_id AS missing_order_id
FROM series
WHERE expected_id NOT IN (SELECT order_id FROM orders);
-- Or more efficiently with LAG:
SELECT order_id + 1 AS gap_start,
next_id - 1 AS gap_end
FROM (
SELECT order_id,
LEAD(order_id) OVER (ORDER BY order_id) AS next_id
FROM orders
) t
WHERE next_id > order_id + 1;Q107: Solve the "islands and gaps" problem — find consecutive date ranges for active subscriptions.
-- Subscriptions table: (customer_id, date, is_active)
-- Find continuous active periods
WITH numbered AS (
SELECT
customer_id,
date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) -
ROW_NUMBER() OVER (PARTITION BY customer_id, is_active ORDER BY date) AS grp
FROM subscriptions
WHERE is_active = TRUE
)
SELECT
customer_id,
MIN(date) AS period_start,
MAX(date) AS period_end,
COUNT(*) AS active_days
FROM numbered
GROUP BY customer_id, grp
ORDER BY customer_id, period_start;Q108: Assign customers to cohorts by their first-order month and calculate retention.
WITH first_orders AS (
SELECT customer_id, DATE_TRUNC('month', MIN(created_at)) AS cohort_month
FROM orders GROUP BY customer_id
),
cohort_activity AS (
SELECT
fo.cohort_month,
DATE_TRUNC('month', o.created_at) AS activity_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM orders o
JOIN first_orders fo ON o.customer_id = fo.customer_id
GROUP BY 1, 2
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) AS size FROM first_orders GROUP BY 1
)
SELECT
ca.cohort_month,
cs.size AS cohort_size,
EXTRACT(MONTH FROM AGE(ca.activity_month, ca.cohort_month)) AS months_later,
ca.active_customers,
ROUND(100.0 * ca.active_customers / cs.size, 1) AS retention_pct
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
ORDER BY 1, 3;Recursive CTEs
Q109: Walk a bill-of-materials (product components) tree.
-- components(parent_product_id, child_product_id, quantity)
WITH RECURSIVE bom AS (
SELECT child_product_id, quantity, 1 AS level, child_product_id::TEXT AS path
FROM components WHERE parent_product_id = 1 -- root product
UNION ALL
SELECT c.child_product_id, c.quantity * bom.quantity, bom.level + 1,
bom.path || ' > ' || c.child_product_id
FROM components c
JOIN bom ON c.parent_product_id = bom.child_product_id
)
SELECT bom.child_product_id, p.name, bom.quantity AS total_needed, level, path
FROM bom
JOIN products p ON bom.child_product_id = p.product_id
ORDER BY path;Q110: Find all paths in a directed graph (detect cycles).
WITH RECURSIVE paths AS (
SELECT from_node, to_node, ARRAY[from_node, to_node] AS path, FALSE AS cycle
FROM edges WHERE from_node = 'A'
UNION ALL
SELECT e.from_node, e.to_node, p.path || e.to_node,
e.to_node = ANY(p.path) -- cycle detection
FROM edges e
JOIN paths p ON e.from_node = p.to_node
WHERE NOT p.cycle
)
SELECT path FROM paths WHERE NOT cycle;Query Optimization
Q111: What is the difference between a hash join, merge join, and nested loop join?
| Algorithm | Best for | Condition | |-----------|---------|-----------| | Nested Loop | Small table + large indexed table | Any join condition | | Hash Join | Large unsorted tables | Equality joins only | | Merge Join | Both tables sorted on join key | Equality joins only |
The optimizer chooses automatically. You can hint the join type in some databases:
-- PostgreSQL: disable hash join for testing
SET enable_hashjoin = OFF;
EXPLAIN SELECT ...;Q112: How does the query planner decide whether to use an index?
It estimates the cost of a full sequential scan vs an index scan, using:
- Table statistics (row count, column cardinality, value distribution histograms)
- Selectivity of the WHERE condition
- Cost of random I/O (index reads) vs sequential I/O (full scan)
For small tables or when returning >10-20% of rows, a sequential scan is often cheaper.
Q113: Write a query to detect stale statistics.
-- PostgreSQL: check last analyze time
SELECT schemaname, tablename, last_vacuum, last_analyze, last_autoanalyze, n_live_tup
FROM pg_stat_user_tables
WHERE last_analyze < NOW() - INTERVAL '7 days'
OR last_analyze IS NULL
ORDER BY n_live_tup DESC;
-- Force statistics update
ANALYZE orders;
ANALYZE customers;Q114: How would you optimize this slow query?
-- Original: slow
SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2025;
-- Optimized: range instead of function (uses index)
SELECT * FROM orders
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';Key principle: Don't apply functions to indexed columns in WHERE — it prevents index use.
Q115: What is a parallel query and when does it help?
The database splits a query across multiple CPU cores. Helps for:
- Full table scans on large tables
- Aggregations over large datasets
- Hash joins of large tables
-- PostgreSQL: control parallelism
SET max_parallel_workers_per_gather = 4;
EXPLAIN ANALYZE SELECT SUM(total_amount) FROM orders;
-- Look for "Parallel Seq Scan" or "Parallel Hash Join"Complex Analytical Queries
Q116: Calculate session duration and pages per session from a page view log.
WITH sessions AS (
SELECT
session_id,
customer_id,
MIN(viewed_at) AS session_start,
MAX(viewed_at) AS session_end,
COUNT(*) AS page_count
FROM page_views
GROUP BY session_id, customer_id
)
SELECT
customer_id,
AVG(EXTRACT(EPOCH FROM (session_end - session_start)) / 60) AS avg_session_minutes,
AVG(page_count) AS avg_pages_per_session,
COUNT(DISTINCT session_id) AS total_sessions
FROM sessions
GROUP BY customer_id
ORDER BY avg_session_minutes DESC;Q117: Find the longest streak of consecutive days a customer placed an order.
WITH daily_orders AS (
SELECT DISTINCT customer_id, created_at::DATE AS order_date
FROM orders
),
grouped AS (
SELECT customer_id, order_date,
order_date - ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY order_date
)::INTEGER AS grp
FROM daily_orders
),
streaks AS (
SELECT customer_id, COUNT(*) AS streak_length
FROM grouped
GROUP BY customer_id, grp
)
SELECT customer_id, MAX(streak_length) AS longest_streak
FROM streaks
GROUP BY customer_id
ORDER BY longest_streak DESC;Q118: Calculate the median order value per product category.
SELECT
c.name AS category,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY oi.line_total) AS median_line_total,
AVG(oi.line_total) AS mean_line_total,
COUNT(*) AS item_count
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
GROUP BY c.category_id, c.name
ORDER BY median_line_total DESC;Q119: Find customers who purchased in two consecutive months but not the third.
WITH monthly_orders AS (
SELECT DISTINCT
customer_id,
DATE_TRUNC('month', created_at) AS month
FROM orders
),
with_lag AS (
SELECT
customer_id,
month,
LAG(month, 1) OVER (PARTITION BY customer_id ORDER BY month) AS prev1,
LAG(month, 2) OVER (PARTITION BY customer_id ORDER BY month) AS prev2
FROM monthly_orders
)
SELECT DISTINCT customer_id, month AS lapsed_month
FROM with_lag
WHERE prev1 = month - INTERVAL '1 month'
AND prev2 = month - INTERVAL '2 months'
AND month + INTERVAL '1 month' NOT IN (
SELECT month FROM monthly_orders mo WHERE mo.customer_id = with_lag.customer_id
);Q120: Implement a slowly changing dimension (SCD Type 2).
Track historical changes to customer data.
-- SCD Type 2 table: keep history with effective dates
CREATE TABLE customer_history (
history_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
email VARCHAR(255),
tier VARCHAR(20),
effective_from TIMESTAMP NOT NULL,
effective_to TIMESTAMP, -- NULL = current record
is_current BOOLEAN DEFAULT TRUE
);
-- When a customer changes tier, expire the old record and insert new one
CREATE OR REPLACE PROCEDURE update_customer_scd2(
p_customer_id INT, p_new_tier VARCHAR
)
LANGUAGE plpgsql AS $$
BEGIN
-- Expire current record
UPDATE customer_history
SET effective_to = NOW(), is_current = FALSE
WHERE customer_id = p_customer_id AND is_current = TRUE;
-- Insert new current record
INSERT INTO customer_history (customer_id, email, tier, effective_from)
SELECT customer_id, email, p_new_tier, NOW()
FROM customers WHERE customer_id = p_customer_id;
END;
$$;
-- Query: what was a customer's tier on a specific date?
SELECT tier FROM customer_history
WHERE customer_id = 42
AND effective_from <= '2025-06-01'
AND (effective_to > '2025-06-01' OR effective_to IS NULL);Transactions and Concurrency
Q121: What is MVCC (Multi-Version Concurrency Control)?
PostgreSQL keeps multiple versions of each row. When a transaction reads data, it sees a snapshot from when the transaction started — not versions modified by concurrent transactions. This avoids read locks.
Key effects:
- Reads don't block writes, writes don't block reads
- Old row versions accumulate until
VACUUMreclaims them - Each transaction has a transaction ID (XID) that determines which versions it can see
Q122: What is a phantom read?
A transaction reads a set of rows, then another transaction inserts rows that match the first transaction's query. If the first re-reads, it sees "phantom" rows. Only SERIALIZABLE isolation prevents this.
-- Transaction 1 reads: SELECT * FROM orders WHERE amount > 100 → 5 rows
-- Transaction 2 inserts: INSERT INTO orders (amount) VALUES (500)
-- Transaction 1 re-reads: SELECT * FROM orders WHERE amount > 100 → 6 rows (phantom!)Q123: Explain SELECT FOR UPDATE vs SELECT FOR SHARE.
-- FOR UPDATE: locks rows exclusively. Other transactions can't read FOR UPDATE or modify.
SELECT * FROM orders WHERE order_id = 42 FOR UPDATE;
-- FOR SHARE: locks rows for reading. Others can read but not modify.
SELECT * FROM orders WHERE order_id = 42 FOR SHARE;
-- SKIP LOCKED: skip rows already locked (useful for job queues)
SELECT * FROM jobs WHERE status = 'pending' LIMIT 10 FOR UPDATE SKIP LOCKED;Q124: How would you implement a distributed counter without race conditions?
-- Using SELECT FOR UPDATE
BEGIN;
SELECT count FROM page_counters WHERE page_id = 1 FOR UPDATE;
UPDATE page_counters SET count = count + 1 WHERE page_id = 1;
COMMIT;
-- Better: use atomic UPDATE + RETURNING
UPDATE page_counters
SET count = count + 1
WHERE page_id = 1
RETURNING count;
-- This is atomic — no explicit lock neededSchema Design Questions
Q125: How would you model a many-to-many relationship?
-- Students and courses — a student can take many courses, a course has many students
CREATE TABLE students (student_id SERIAL PRIMARY KEY, name VARCHAR(100));
CREATE TABLE courses (course_id SERIAL PRIMARY KEY, name VARCHAR(100));
-- Junction table
CREATE TABLE enrollments (
student_id INT NOT NULL REFERENCES students(student_id),
course_id INT NOT NULL REFERENCES courses(course_id),
enrolled_at TIMESTAMP DEFAULT NOW(),
grade CHAR(2),
PRIMARY KEY (student_id, course_id) -- composite PK prevents duplicates
);Q126: How would you store hierarchical data (category trees)?
Option 1: Adjacency List — parent_id column
- Simple but slow for querying subtrees (requires recursive CTEs)
Option 2: Materialized Path — store full path as string
-- category_path: 'Electronics/Laptops/Gaming'
WHERE category_path LIKE 'Electronics/%' -- fast for simple lookupsOption 3: Nested Sets — lft, rgt values
- Very fast subtree queries, slow writes
Option 4: Closure Table — separate table with all ancestor-descendant pairs
- Best for complex hierarchies with frequent queries
Q127: Design a table for storing product variants (size, color, etc.).
CREATE TABLE attribute_types (
type_id SERIAL PRIMARY KEY,
name VARCHAR(50) -- 'Size', 'Color', 'Material'
);
CREATE TABLE attribute_values (
value_id SERIAL PRIMARY KEY,
type_id INT REFERENCES attribute_types(type_id),
value VARCHAR(100) -- 'Red', 'XL', 'Cotton'
);
CREATE TABLE product_variants (
variant_id SERIAL PRIMARY KEY,
product_id INT REFERENCES products(product_id),
sku VARCHAR(100) UNIQUE,
price_adj DECIMAL(10,2) DEFAULT 0 -- price adjustment from base
);
CREATE TABLE variant_attributes (
variant_id INT REFERENCES product_variants(variant_id),
value_id INT REFERENCES attribute_values(value_id),
PRIMARY KEY (variant_id, value_id)
);Q128: How would you audit every change to an orders table?
CREATE TABLE orders_audit (
audit_id SERIAL PRIMARY KEY,
operation CHAR(1), -- 'I', 'U', 'D'
changed_at TIMESTAMP DEFAULT NOW(),
changed_by VARCHAR(100) DEFAULT current_user,
order_id INT,
old_status VARCHAR(20),
new_status VARCHAR(20),
old_amount DECIMAL(10,2),
new_amount DECIMAL(10,2)
);
CREATE OR REPLACE FUNCTION audit_orders()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO orders_audit (operation, order_id, old_status, new_status, old_amount, new_amount)
VALUES (
TG_OP::CHAR(1),
COALESCE(NEW.order_id, OLD.order_id),
OLD.status, NEW.status,
OLD.total_amount, NEW.total_amount
);
RETURN NEW;
END;
$$;
CREATE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION audit_orders();Advanced Query Patterns
Q129: Write a query to transpose columns to rows (UNPIVOT).
-- Monthly sales in wide format: (year, jan, feb, mar, ...)
-- Unpivot to: (year, month, sales)
SELECT year, month, sales
FROM monthly_sales
CROSS JOIN LATERAL (VALUES
('January', jan_sales),
('February', feb_sales),
('March', mar_sales),
('April', apr_sales)
) AS unpivoted(month, sales);Q130: How do you do a fuzzy string match?
-- PostgreSQL: similarity (pg_trgm extension)
CREATE EXTENSION pg_trgm;
SELECT name, similarity(name, 'Jon Smith') AS sim
FROM customers
WHERE similarity(name, 'Jon Smith') > 0.4
ORDER BY sim DESC;
-- Levenshtein distance (fuzzystrmatch extension)
CREATE EXTENSION fuzzystrmatch;
SELECT name, levenshtein(name, 'John Smith') AS distance
FROM customers
ORDER BY distance LIMIT 5;Q131: What is the difference between EXISTS and COUNT for checking if a row exists?
-- SLOWER: counts all matching rows
IF (SELECT COUNT(*) FROM orders WHERE customer_id = 42) > 0 THEN ...
-- FASTER: stops at the first match
IF EXISTS (SELECT 1 FROM orders WHERE customer_id = 42) THEN ...EXISTS short-circuits — it returns as soon as any matching row is found.
Q132: How do you partition a table by date in PostgreSQL?
-- Declarative partitioning (PostgreSQL 10+)
CREATE TABLE orders (
order_id INT,
created_at TIMESTAMP NOT NULL,
total_amount DECIMAL(10,2)
) PARTITION BY RANGE (created_at);
-- Create yearly partitions
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Queries automatically use only the relevant partition
EXPLAIN SELECT * FROM orders WHERE created_at >= '2025-01-01';
-- Shows only "orders_2025" being scannedQ133: What is the difference between CHAR(36) and UUID for storing UUIDs?
-- UUID native type (PostgreSQL, MySQL 8.0+)
CREATE TABLE events (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
...
);
-- Advantages of native UUID:
-- 4 bytes less storage than CHAR(36) (16 bytes vs 36)
-- Proper type checking
-- Indexable without text comparison overheadQ134: How do you copy table structure without data?
-- PostgreSQL
CREATE TABLE orders_backup AS SELECT * FROM orders WHERE FALSE;
-- or
CREATE TABLE orders_backup (LIKE orders INCLUDING ALL); -- includes indexes and constraints
-- SQL Server
SELECT * INTO orders_backup FROM orders WHERE 1=0;Q135: Explain the difference between IN (subquery) and = ANY (subquery).
Functionally equivalent in most cases, but ANY can use operators other than =:
-- These are identical:
WHERE salary IN (SELECT salary FROM top_earners)
WHERE salary = ANY (SELECT salary FROM top_earners)
-- ANY with other operators:
WHERE salary > ANY (SELECT salary FROM managers) -- above any manager's salary
WHERE salary > ALL (SELECT salary FROM managers) -- above ALL managers' salariesQ136–Q200: Expert Pattern Questions
Q136: Find pairs of products that are always bought together.
SELECT p1.name, p2.name, COUNT(*) AS co_occurrences
FROM order_items oi1
JOIN order_items oi2 ON oi1.order_id = oi2.order_id AND oi1.product_id < oi2.product_id
JOIN products p1 ON oi1.product_id = p1.product_id
JOIN products p2 ON oi2.product_id = p2.product_id
GROUP BY p1.product_id, p1.name, p2.product_id, p2.name
HAVING COUNT(*) >= 10
ORDER BY co_occurrences DESC;Q137: How do you calculate week-over-week growth?
SELECT week, revenue,
LAG(revenue) OVER (ORDER BY week) AS prev_week,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY week)) /
NULLIF(LAG(revenue) OVER (ORDER BY week), 0), 1) AS wow_growth
FROM weekly_revenue;Q138: What is the difference between NOW() and CURRENT_TIMESTAMP?
Both return the current timestamp. NOW() is a PostgreSQL function; CURRENT_TIMESTAMP is SQL standard. They behave identically in PostgreSQL.
Q139: How do you find records created in the last complete week (Mon-Sun)?
WHERE created_at >= DATE_TRUNC('week', CURRENT_DATE - 7)
AND created_at < DATE_TRUNC('week', CURRENT_DATE);Q140: How would you find the top-selling product in each category?
WITH ranked AS (
SELECT p.category_id, p.product_id, p.name,
SUM(oi.line_total) AS revenue,
RANK() OVER (PARTITION BY p.category_id ORDER BY SUM(oi.line_total) DESC) AS rnk
FROM products p JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.category_id, p.product_id, p.name
)
SELECT category_id, product_id, name, revenue FROM ranked WHERE rnk = 1;Q141: What is FILTER in aggregate functions?
-- Count delivered and cancelled orders in one scan
SELECT
COUNT(*) FILTER (WHERE status = 'delivered') AS delivered,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled
FROM orders;Q142: How do you implement rate limiting queries (e.g., max 5 orders per hour per customer)?
SELECT customer_id, COUNT(*) AS orders_last_hour
FROM orders
WHERE created_at >= NOW() - INTERVAL '1 hour'
GROUP BY customer_id
HAVING COUNT(*) > 5;Q143: How do you get the Nth row in a query?
SELECT * FROM products ORDER BY product_id LIMIT 1 OFFSET N-1;
-- Or with ROW_NUMBER:
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY product_id) AS rn FROM products) t WHERE rn = N;Q144: What is tablespace in PostgreSQL? A location on disk where database objects (tables, indexes) are stored. Allows distributing storage across multiple disks.
Q145: How do you query JSON arrays in PostgreSQL?
-- JSON column: tags JSONB = '["sql","python","data"]'
SELECT * FROM articles WHERE tags @> '["sql"]'; -- contains
SELECT * FROM articles WHERE tags ? 'python'; -- has key/element
SELECT jsonb_array_elements_text(tags) FROM articles; -- unnest arrayQ146: What is EXCLUDE constraint?
A generalization of UNIQUE. Ensures no two rows overlap according to a specified operator.
-- No two events for the same room should overlap in time (requires btree_gist)
CREATE TABLE bookings (
room_id INT, period TSRANGE,
EXCLUDE USING GIST (room_id WITH =, period WITH &&)
);Q147: How do you implement full-text search with ranking?
SELECT title, ts_rank(to_tsvector('english', content), query) AS rank
FROM articles,
to_tsquery('english', 'machine & learning') AS query
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;Q148: What is pg_stat_statements?
A PostgreSQL extension that tracks statistics for all executed queries — total calls, total time, rows returned. Essential for identifying slow queries.
Q149: How do you prevent SQL injection? Always use parameterized queries / prepared statements. Never concatenate user input into SQL strings.
-- UNSAFE
query = "SELECT * FROM users WHERE email = '" + user_input + "'"
-- SAFE (parameterized)
query = "SELECT * FROM users WHERE email = $1"
execute(query, [user_input])Q150: What is connection pooling and why is it needed? Creating a database connection is expensive (TCP handshake, auth, memory allocation). Connection poolers (PgBouncer, HikariCP) maintain a pool of open connections and reuse them, allowing thousands of application requests to share a small pool of actual database connections.
Q151: How do you check the size of a table?
-- PostgreSQL
SELECT
pg_size_pretty(pg_total_relation_size('orders')) AS total_size,
pg_size_pretty(pg_relation_size('orders')) AS table_size,
pg_size_pretty(pg_indexes_size('orders')) AS indexes_size;Q152: How do you kill a long-running query?
-- Find long-running queries
SELECT pid, query, NOW() - query_start AS duration, state
FROM pg_stat_activity
WHERE state = 'active' AND NOW() - query_start > INTERVAL '1 minute'
ORDER BY duration DESC;
-- Kill it
SELECT pg_cancel_backend(pid); -- sends SIGINT (graceful)
SELECT pg_terminate_backend(pid); -- sends SIGTERM (force)Q153: What is pg_dump?
PostgreSQL backup tool. Creates a script or custom-format archive to restore a database.
pg_dump -Fc -d mydb -f backup.dump # custom format
pg_restore -d mydb backup.dump # restoreQ154: How do you rename a table?
ALTER TABLE old_name RENAME TO new_name;Q155: What does RETURNING do?
Returns the values of rows affected by INSERT, UPDATE, or DELETE — eliminates a follow-up SELECT.
INSERT INTO orders (customer_id, total_amount)
VALUES (42, 99.99)
RETURNING order_id, created_at;Q156: How do you create a read-only user?
CREATE USER analyst WITH PASSWORD 'password';
GRANT CONNECT ON DATABASE shopdb TO analyst;
GRANT USAGE ON SCHEMA public TO analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst;
-- For future tables:
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO analyst;Q157: What is row-level security (RLS)? Restricts which rows a user can see/modify based on policies.
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY customer_own_orders ON orders
USING (customer_id = current_setting('app.current_customer_id')::INT);
-- Each customer now only sees their own ordersQ158: How do you use GENERATE_SERIES for testing data?
-- Generate 10,000 test orders
INSERT INTO orders (customer_id, total_amount, created_at, status)
SELECT
(RANDOM() * 999 + 1)::INT,
ROUND((RANDOM() * 1000)::NUMERIC, 2),
NOW() - (RANDOM() * INTERVAL '365 days'),
(ARRAY['pending','delivered','cancelled'])[1 + (RANDOM() * 2)::INT]
FROM generate_series(1, 10000);Q159: How do you detect hotspots in your index?
-- PostgreSQL: find most scanned indexes
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Find unused indexes (idx_scan = 0)
SELECT indexrelname FROM pg_stat_user_indexes WHERE idx_scan = 0;Q160: What is vacuum bloat?
After many UPDATEs and DELETEs, PostgreSQL tables accumulate dead tuples (old row versions). This "bloat" wastes disk space and slows queries. VACUUM reclaims it. VACUUM FULL rewrites the table completely (requires exclusive lock).
Q161–Q200: Rapid-fire expert questions
Q161: What is the difference between REPEATABLE READ and SERIALIZABLE? Serializable prevents phantom reads; Repeatable Read does not.
Q162: Can you have a foreign key to a non-primary key column? Yes, if it has a unique constraint.
Q163: What is a composite primary key? A primary key made of multiple columns — their combination must be unique.
Q164: What is the difference between ON UPDATE CASCADE and ON DELETE CASCADE? ON UPDATE CASCADE propagates primary key changes to foreign key columns. ON DELETE CASCADE deletes child rows when parent is deleted.
Q165: What is an anti-join? A join that returns rows from the left table that have NO match in the right table. Implemented with LEFT JOIN + WHERE right.id IS NULL, or NOT EXISTS.
Q166: What is a lateral join? A join where each row of the left table is used as input to the right-side subquery — enables correlated per-row subqueries in the FROM clause.
Q167: What is DISTINCT ON in PostgreSQL? Returns one row per distinct value of the specified column(s), keeping the first row per group according to ORDER BY.
Q168: What is the difference between = ANY and IN? Functionally the same for equality. ANY supports other operators (> ANY, < ANY).
Q169: What are table statistics in PostgreSQL? Metadata about column value distributions (stored in pg_statistic) used by the query planner to estimate row counts and choose optimal plans.
Q170: What is predicate pushdown? The optimizer moves WHERE filters as early as possible in the execution plan — closer to table scans — to reduce rows processed at higher levels.
Q171: What is parallelism degree? The number of parallel workers assigned to a query. Controlled by max_parallel_workers_per_gather in PostgreSQL.
Q172: What is WAL (Write-Ahead Log)? A transaction log where all changes are written before they're applied to data files. Enables crash recovery and replication.
Q173: What is streaming replication? Continuously sending WAL records from primary to standby servers for high availability and read scaling.
Q174: What is a hot standby? A read-only replica that accepts SELECT queries while receiving replication data from the primary.
Q175: What is pg_bouncer? A lightweight PostgreSQL connection pooler that reduces database load by reusing connections.
Q176: What is EXPLAIN (BUFFERS)? Shows disk/memory buffer statistics — pages read from disk vs cache. Helps diagnose I/O issues.
Q177: What is a function index? An index on an expression or function: CREATE INDEX idx ON t (LOWER(email)). Makes WHERE LOWER(email) = 'x' use the index.
Q178: What is a GIN index? Generalized Inverted Index — best for arrays, JSONB, and full-text search. Indexes each element/word separately.
Q179: What is a GiST index? Generalized Search Tree — supports geometric data, ranges, nearest-neighbor searches.
Q180: What is BRIN index? Block Range INdex — stores min/max values per physical block. Very small, but only efficient for naturally ordered data (timestamps, auto-increment IDs).
Q181: What is the difference between SERIAL and IDENTITY in PostgreSQL? SERIAL is legacy sugar for a sequence. IDENTITY (PostgreSQL 10+) is SQL standard, attached directly to the column, and has GENERATED ALWAYS (prevents manual override) vs GENERATED BY DEFAULT.
Q182: What does GENERATED ALWAYS AS do? Defines a computed column — automatically calculated from other columns, cannot be inserted directly.
CREATE TABLE order_items (
quantity INT, unit_price DECIMAL(10,2),
line_total DECIMAL(10,2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);Q183: What is LISTEN/NOTIFY? PostgreSQL async messaging: one session can NOTIFY channel, 'payload', another that called LISTEN channel receives it. Used for real-time updates.
Q184: What is COPY? Fast bulk import/export. COPY orders FROM '/tmp/orders.csv' CSV HEADER; — much faster than thousands of INSERTs.
Q185: What is \COPY in psql vs COPY? COPY runs server-side (file must be on server). \COPY runs client-side (file on your machine), handled by psql.
Q186: What is pg_restore? Restores a database from a custom-format pg_dump archive.
Q187: What is point-in-time recovery (PITR)? Restoring a database to any point in time using base backups + WAL archives.
Q188: What is logical replication vs physical replication? Physical: byte-level WAL streaming (same major version required). Logical: row-level changes in SQL format (works across major versions, can replicate specific tables).
Q189: What is a foreign data wrapper (FDW)? Allows querying external data sources (other databases, CSV files, REST APIs) as if they were local tables.
CREATE EXTENSION postgres_fdw;
CREATE SERVER remote_db FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'remote', dbname 'db');
CREATE FOREIGN TABLE remote_orders (...) SERVER remote_db;
SELECT * FROM remote_orders;Q190: What is the difference between heap and index-only scan? Heap scan: index finds row location, then fetches from table. Index-only scan: all needed columns are in the index, no table access needed (requires covering index and recent visibility map).
Q191: What is a HOT update (Heap Only Tuple)? When an UPDATE doesn't change any indexed columns, PostgreSQL can store the new version in the same heap page without updating the index, reducing write amplification.
Q192: What is CLUSTER? Physically reorders a PostgreSQL table according to an index. One-time operation — subsequent inserts don't maintain the order.
Q193: What is pg_stat_bgwriter? Statistics about PostgreSQL background writer — shared buffer checkpoints, buffers written. Helps diagnose I/O performance.
Q194: What is autovacuum? A background process that automatically runs VACUUM and ANALYZE to reclaim dead tuple space and update statistics. Triggered by dead tuple count thresholds.
Q195: What is the transaction ID wraparound problem? PostgreSQL uses 32-bit transaction IDs. After ~2 billion transactions, IDs wrap around and old data could be seen as future. Avoided by regular VACUUM keeping transaction ID age in check.
Q196: What is logical decoding? Allows reading changes from WAL as a stream of row-level events (used by Debezium, logical replication, change data capture systems).
Q197: What is the difference between TRUNCATE and DELETE in terms of transactions? Both are transactional in PostgreSQL. TRUNCATE cannot be rolled back in MySQL (DDL statement). In PostgreSQL, both are fully transactional.
Q198: What is EXPLAIN (FORMAT JSON)? Returns the execution plan as JSON instead of text — useful for programmatic analysis or visualization tools.
Q199: What is auto_explain? A PostgreSQL module that automatically logs execution plans for slow queries above a configurable threshold.
Q200: How do you tune work_mem for sort/hash operations? work_mem is the memory available per sort or hash operation. Higher values reduce disk spills. Be careful — each operation can use this much, and a complex query with many sorts can multiply the impact. Set per-session for intensive queries rather than globally.
SET work_mem = '256MB'; -- for this session only
SELECT ... ORDER BY ... LIMIT ...;Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.