Concurrency Bugs That Cost Real Money: Race Conditions, N+1, Throttling & More
Real-world software bugs that silently drain performance, corrupt data, and bring down production systems ā with war stories, root causes, and battle-tested fixes.
Some bugs crash your app immediately. You fix them and move on.
The dangerous bugs are the ones that don't crash anything. They quietly corrupt data, silently overspend money, or slowly strangle performance while your monitoring shows green. You only find them three months later ā when a bank calls.
This guide covers the class of bugs that have caused real financial loss, real outages, and real incidents at companies you use every day.
1. Race Conditions ā The Bug That Emptied a Bank Account
The Real Story
A customer at Nordea Bank in Norway had approximately 20,000 NOK in their account. They needed cash for a bathroom renovation. Their husband withdrew 19,000 NOK at an ATM. That same day, a pending bill of 20,000 NOK was scheduled for automatic payment.
Both transactions went through.
The account went roughly 19,000 NOK negative. Nordea didn't notice for three months.
This is a race condition in production, at a real bank, affecting a real family.
What Actually Happened
Account balance: 20,000 NOK
ATM withdrawal thread: Direct debit thread:
READ balance ā 20,000 READ balance ā 20,000
CHECK: 19k ⤠20k ā CHECK: 20k ⤠20k ā
WRITE balance ā 1,000 WRITE balance ā 0
ā last write wins
ATM write is lostBoth systems read the balance before either wrote back. Both saw 20,000 NOK. Both approved. The last write overwrote the first ā and 39,000 NOK left a 20,000 NOK account.
Why It Happens
Race conditions occur when two operations depend on shared state, and the outcome depends on timing rather than logic. In distributed systems this is almost guaranteed unless you design against it explicitly.
Nordea's specific problem: the ATM network and the direct debit network were separate systems with separate ledger reads. Neither knew the other was running.
The Fix
Option 1 ā Pessimistic locking (lock first, act second):
BEGIN;
-- Lock this row ā all other transactions must wait
SELECT balance FROM accounts
WHERE account_id = 123
FOR UPDATE;
-- Now safe to check and deduct
UPDATE accounts
SET balance = balance - 19000
WHERE account_id = 123;
COMMIT;
-- Lock released ā next transaction can now proceedOption 2 ā Optimistic locking (detect conflict, retry):
-- Read with version
SELECT balance, version FROM accounts WHERE account_id = 123;
-- returns: { balance: 20000, version: 5 }
-- Write only if nobody else changed it
UPDATE accounts
SET balance = balance - 19000,
version = version + 1
WHERE account_id = 123
AND version = 5; -- fails if ATM or bill already changed the row
-- rows_affected = 0 ā conflict detected ā retry or rejectOption 3 ā Available balance (simplest, most impactful):
-- Two fields: what you have vs what you can spend
ALTER TABLE accounts ADD COLUMN available_balance DECIMAL(15,2);
ALTER TABLE accounts ADD COLUMN ledger_balance DECIMAL(15,2);
-- When bill becomes "pending":
UPDATE accounts
SET available_balance = available_balance - 20000
WHERE account_id = 123;
-- available is now 0 ā ATM sees 0, withdrawal rejected cleanlyThis is why your bank app shows "Available: Ā£180 / Balance: Ā£200". The difference is pending transactions already deducted from what you can actually spend.
Real-world adoption:
- Monzo / Revolut / Starling ā event-sourced immutable ledgers, balance is recalculated from append-only transaction history. This specific bug is architecturally impossible.
- PayPal ā optimistic locking with version columns on their internal ledger
- Stripe ā idempotency keys on every API mutation, preventing duplicate charges on network retries
2. The N+1 Problem ā Slow by Design
The Bug
You load a list of 100 orders. For each order, you load the customer. That's 1 query for orders + 100 queries for customers = 101 database round trips.
At 2ms per query: 100 orders takes 200ms. 1,000 orders takes 2,000ms. Your "fast" API is silently scaling O(n).
// This looks innocent
async function getOrdersWithCustomers() {
const orders = await db.query("SELECT * FROM orders LIMIT 100");
// 1 query ā
for (const order of orders) {
order.customer = await db.query(
"SELECT * FROM customers WHERE id = $1",
[order.customer_id]
);
// 100 more queries ā ā N+1
}
return orders;
}Why It's Hard to Spot
ORMs make this invisible. The query looks like a property access:
// TypeORM ā looks like a simple property read
const orders = await orderRepository.find();
for (const order of orders) {
console.log(order.customer.name); // ā triggers a SELECT behind the scenes
}In development with 5 orders it feels instant. In production with 50,000 orders, your database is on fire.
The Fix
Eager loading ā fetch everything in one JOIN:
SELECT
o.id, o.total, o.created_at,
c.id AS customer_id, c.name, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '30 days';In TypeORM:
const orders = await orderRepository.find({
relations: ["customer"], // ā single JOIN query, not 100 SELECTs
});Batch loading ā load related records in one IN query:
const orders = await db.query("SELECT * FROM orders LIMIT 100");
// Collect all customer IDs, load in one shot
const customerIds = orders.map(o => o.customer_id);
const customers = await db.query(
"SELECT * FROM customers WHERE id = ANY($1)",
[customerIds]
);
// Map back
const customerMap = Object.fromEntries(customers.map(c => [c.id, c]));
for (const order of orders) {
order.customer = customerMap[order.customer_id];
}
// Total: 2 queries regardless of NDataLoader pattern (used by GraphQL servers):
import DataLoader from "dataloader";
const customerLoader = new DataLoader(async (ids: readonly string[]) => {
const customers = await db.query(
"SELECT * FROM customers WHERE id = ANY($1)", [ids]
);
return ids.map(id => customers.find(c => c.id === id));
});
// Now each customer lookup is batched automatically
const customer = await customerLoader.load(order.customer_id);Real-world impact:
- GitHub famously fixed N+1 queries in their pull request timeline ā page load dropped from 4s to 400ms
- Shopify tracks N+1 as a first-class metric in their performance budget
- Facebook built DataLoader specifically because GraphQL field resolvers create N+1 by default
3. Throttling Failures ā When You Trust the Other Side
The Bug
You call a third-party API in a tight loop. The API returns 429 Too Many Requests. Your code crashes, retries immediately, gets throttled again, crashes again. Or worse ā it silently drops data.
// Fetching prices for 10,000 products
async function syncAllPrices(productIds: string[]) {
for (const id of productIds) {
const price = await pricingApi.getPrice(id); // ā no rate limiting
await db.update("products", { price }, { id });
}
}
// At 100 products: fine
// At 10,000 products: 429 errors from minute 1
// API bans your key after sustained abuseThe Real Pattern: Thundering Herd
A related bug: your cache expires at midnight. 50,000 users hit your site at 00:00:01. Every request misses the cache. Every request hits the database simultaneously. Database falls over.
00:00:00 ā cache valid, 50k users ā cache hits ā database quiet
00:00:01 ā cache expires
00:00:01 ā 50,000 simultaneous requests ā cache miss ā 50,000 DB queries
00:00:01 ā database CPU: 100%, connections exhausted, timeouts beginThe Fix
Exponential backoff with jitter:
async function callWithRetry<T>(
fn: () => Promise<T>,
maxRetries = 5
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err: any) {
if (err.status !== 429 || attempt === maxRetries) throw err;
// Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600ms
// + random jitter to prevent synchronized retries
const base = Math.pow(2, attempt) * 100;
const jitter = Math.random() * 100;
await sleep(base + jitter);
}
}
throw new Error("Max retries exceeded");
}Token bucket rate limiter:
class RateLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number,
private refillRate: number // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<void> {
// Refill tokens based on elapsed time
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.maxTokens,
this.tokens + elapsed * this.refillRate
);
this.lastRefill = now;
if (this.tokens >= 1) {
this.tokens--;
return;
}
// Wait until a token is available
const waitMs = ((1 - this.tokens) / this.refillRate) * 1000;
await sleep(waitMs);
this.tokens = 0;
}
}
const limiter = new RateLimiter(10, 10); // 10 req/sec max
for (const id of productIds) {
await limiter.acquire();
const price = await pricingApi.getPrice(id);
await db.update("products", { price }, { id });
}Cache stampede prevention (stale-while-revalidate):
async function getWithSWR(key: string, fetchFn: () => Promise<any>) {
const cached = await redis.get(key);
if (cached) {
const { value, expiresAt } = JSON.parse(cached);
// If expiring soon, refresh in background ā serve stale immediately
if (expiresAt - Date.now() < 30_000) {
fetchFn().then(fresh => {
redis.set(key, JSON.stringify({ value: fresh, expiresAt: Date.now() + 300_000 }));
});
}
return value; // ā always return immediately, never stampede
}
// True miss ā one request fetches, others wait on a lock
const lock = await redis.set(`lock:${key}`, "1", "NX", "EX", 10);
if (!lock) {
await sleep(100);
return getWithSWR(key, fetchFn); // retry after lock holder fills cache
}
const value = await fetchFn();
await redis.set(key, JSON.stringify({ value, expiresAt: Date.now() + 300_000 }));
await redis.del(`lock:${key}`);
return value;
}Real-world impact:
- Reddit experienced a cascading throttle failure in 2023 when their API rate limit changes caused third-party apps to hammer retries simultaneously
- Twitter/X API throttling during high-profile events caused downstream app failures that looked like the apps were broken, not the API
- AWS SDK builds exponential backoff with jitter into all clients by default after internal incidents
4. Double-Write / Split-Brain ā Two Sources of Truth
The Bug
You write to a database and a cache. One succeeds, one fails. Now they disagree forever.
async function updateUserProfile(userId: string, data: Partial<User>) {
await db.update("users", data, { id: userId }); // ā succeeds
await redis.set(`user:${userId}`, JSON.stringify(data)); // ā network timeout
// DB has new email, cache has old email
// Every cached read returns stale data
// Cache TTL is 24 hours
// User is confused for 24 hours
}The same bug at a larger scale: you write to two databases for redundancy. A network partition means writes reach one but not the other. Now you have two databases that disagree on the truth. This is called split-brain.
The Fix
Cache-aside (read-through, write-invalidate):
async function updateUserProfile(userId: string, data: Partial<User>) {
await db.update("users", data, { id: userId });
// Don't update cache ā DELETE it
// Next read will fetch from DB and repopulate
await redis.del(`user:${userId}`);
// If redis.del fails, stale cache eventually expires
// DB is always authoritative
}
async function getUserProfile(userId: string) {
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);
// Cache miss ā fetch from DB, store with TTL
const user = await db.findOne("users", { id: userId });
await redis.set(`user:${userId}`, JSON.stringify(user), "EX", 3600);
return user;
}Outbox pattern (guaranteed dual writes):
// Both writes in a single DB transaction
await db.transaction(async (trx) => {
await trx.update("users", data, { id: userId });
// Write the cache update as a pending event in the DB
await trx.insert("outbox", {
type: "USER_UPDATED",
payload: JSON.stringify({ userId, data }),
processed: false,
});
// Transaction commits both or neither
});
// Separate background process reads outbox and updates cache/other systems
// If it fails, it retries ā outbox row stays until processedReal-world impact:
- LinkedIn had a split-brain incident in 2011 where two data centres diverged. Users saw different profile data depending on which DC served their request
- Slack uses the outbox pattern for all cross-service writes to guarantee eventual consistency without distributed transactions
5. Memory Leaks ā The Slow Suffocation
The Bug
Something holds a reference to memory that should have been freed. The process grows slowly, eventually consuming all available RAM, then crashes or is OOM-killed.
// Classic Node.js leak: event listener never removed
class OrderProcessor extends EventEmitter {
processOrder(orderId: string) {
// This listener is added every time processOrder is called
// It's never removed
this.on("orderComplete", (id) => {
if (id === orderId) {
// handle completion
}
});
}
}
const processor = new OrderProcessor();
// Called 100,000 times ā 100,000 listeners accumulate ā heap grows forever// Another classic: closures capturing large objects
const cache = new Map();
function processImage(imageId: string, imageData: Buffer) {
const processed = heavyTransform(imageData); // 50MB
cache.set(imageId, {
result: processed,
cleanup: () => { /* captures imageData ā 50MB never freed */ }
});
}
// After 100 images: 5GB in cache, never evictedThe Fix
// Always remove listeners
class OrderProcessor extends EventEmitter {
processOrder(orderId: string) {
const handler = (id: string) => {
if (id === orderId) {
this.off("orderComplete", handler); // ā remove after use
}
};
this.on("orderComplete", handler);
}
}
// Bounded cache with LRU eviction
import { LRUCache } from "lru-cache";
const cache = new LRUCache<string, Buffer>({
max: 100, // max 100 items
maxSize: 500_000_000, // max 500MB total
sizeCalculation: (value) => value.length,
ttl: 1000 * 60 * 10, // 10 minutes
});Real-world impact:
- Firefox had persistent memory leak bugs in early versions that became memes ā browsers that consumed gigabytes of RAM after a few hours
- Node.js production services commonly leak through uncleaned intervals, global Maps, and event listener accumulation
6. Time-of-Check to Time-of-Use (TOCTOU) ā The Check That Lies
The Bug
You check a condition. By the time you act on it, the condition has changed.
// File system TOCTOU ā classic security vulnerability
async function saveUpload(filename: string, content: Buffer) {
// CHECK: does file exist?
const exists = await fs.access(filename).then(() => true).catch(() => false);
if (!exists) {
// USE: create file
// ā an attacker can create a symlink HERE, between check and write
await fs.writeFile(filename, content);
// Now writing to wherever the symlink points ā /etc/passwd?
}
}// Database TOCTOU ā the exact Nordea pattern
async function withdrawMoney(accountId: string, amount: number) {
// CHECK
const { balance } = await db.query(
"SELECT balance FROM accounts WHERE id = $1", [accountId]
);
if (balance >= amount) {
// ā another withdrawal can happen HERE
// USE
await db.query(
"UPDATE accounts SET balance = balance - $1 WHERE id = $2",
[amount, accountId]
);
}
}The Fix
Collapse check and act into a single atomic operation:
// Atomic check-and-update ā no window for race
const result = await db.query(`
UPDATE accounts
SET balance = balance - $1
WHERE id = $2 AND balance >= $1
RETURNING balance
`, [amount, accountId]);
if (result.rows.length === 0) {
throw new Error("Insufficient funds or account not found");
}
// If it returned a row, the deduction succeeded and balance was sufficient
// These two facts are atomically guaranteed7. Deadlocks ā Two Threads Waiting for Each Other Forever
The Bug
Thread A holds lock on Resource 1, wants Resource 2. Thread B holds lock on Resource 2, wants Resource 1. Both wait forever.
-- Transaction A -- Transaction B
BEGIN; BEGIN;
UPDATE accounts UPDATE accounts
SET balance = balance - 100 SET balance = balance - 200
WHERE id = 1; -- locks row 1 WHERE id = 2; -- locks row 2
UPDATE accounts UPDATE accounts
SET balance = balance + 100 SET balance = balance + 200
WHERE id = 2; -- BLOCKS WHERE id = 1; -- BLOCKS
-- waiting for B to release row 2 -- waiting for A to release row 1
-- DEADLOCKThe Fix
Always acquire locks in the same order:
async function transferMoney(fromId: string, toId: string, amount: number) {
// Always lock lower ID first ā consistent ordering across all callers
const [firstId, secondId] = [fromId, toId].sort();
await db.transaction(async (trx) => {
// Both transactions always lock accounts in ascending ID order
// Deadlock is now impossible ā no circular wait
await trx.query("SELECT 1 FROM accounts WHERE id = $1 FOR UPDATE", [firstId]);
await trx.query("SELECT 1 FROM accounts WHERE id = $2 FOR UPDATE", [secondId]);
await trx.query("UPDATE accounts SET balance = balance - $1 WHERE id = $2", [amount, fromId]);
await trx.query("UPDATE accounts SET balance = balance + $1 WHERE id = $2", [amount, toId]);
});
}Real-world impact:
- MySQL deadlock logs are one of the most common things DBAs investigate in e-commerce systems
- PostgreSQL detects deadlocks automatically and kills one transaction with
ERROR: deadlock detectedā but the business operation is lost
Summary: The Pattern They All Share
Every bug in this article follows the same root cause:
You assumed a condition would remain true between when you checked it and when you acted on it.
| Bug | Assumption That Fails | |---|---| | Race condition | "Balance hasn't changed since I read it" | | N+1 | "Loading each record separately is fine at scale" | | Throttling | "The API will keep accepting my requests" | | Double-write | "Both systems will accept my write" | | Memory leak | "This object will be freed when I'm done with it" | | TOCTOU | "The state hasn't changed between check and act" | | Deadlock | "The other thread will finish before I need its lock" |
The fix in every case: make the check and the act a single atomic operation, or design so the check is unnecessary.
Checklist Before You Ship
ā” Any place two operations share state without a transaction? ā Race condition
ā” Loading a list then querying each item individually? ā N+1
ā” Calling external APIs without rate limiting or retry backoff? ā Throttling
ā” Writing to DB + cache/queue without rollback on partial failure? ā Double-write
ā” Event listeners or timers without cleanup on component teardown? ā Memory leak
ā” Checking a condition then acting on it in separate statements? ā TOCTOU
ā” Multiple locks acquired in different orders across code paths? ā DeadlockThese bugs don't announce themselves. Write the checklist into your PR template ā they're the ones that call you at 3am, or show up three months later in a bank statement.
Enjoyed this article?
Explore the Backend Systems learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.