Concurrency Bugs That Cost Real Money: Race Conditions, N+1, Throttling & More

Some bugs crash your app immediately. You fix them and move on.

The dangerous bugs are the ones that don't crash anything. They quietly corrupt data, silently overspend money, or slowly strangle performance while your monitoring shows green. You only find them three months later — when a bank calls.

This guide covers the class of bugs that have caused real financial loss, real outages, and real incidents at companies you use every day.

1. Race Conditions — The Bug That Emptied a Bank Account

The Real Story

A customer at Nordea Bank in Norway had approximately 20,000 NOK in their account. They needed cash for a bathroom renovation. Their husband withdrew 19,000 NOK at an ATM. That same day, a pending bill of 20,000 NOK was scheduled for automatic payment.

Both transactions went through.

The account went roughly 19,000 NOK negative. Nordea didn't notice for three months.

This is a race condition in production, at a real bank, affecting a real family.

What Actually Happened

Account balance: 20,000 NOK

ATM withdrawal thread:              Direct debit thread:
  READ balance  → 20,000              READ balance  → 20,000
  CHECK: 19k ≤ 20k ✓                  CHECK: 20k ≤ 20k ✓
  WRITE balance → 1,000               WRITE balance → 0
                                                     ↑ last write wins
                                                     ATM write is lost

Both systems read the balance before either wrote back. Both saw 20,000 NOK. Both approved. The last write overwrote the first — and 39,000 NOK left a 20,000 NOK account.

Why It Happens

Race conditions occur when two operations depend on shared state, and the outcome depends on timing rather than logic. In distributed systems this is almost guaranteed unless you design against it explicitly.

Nordea's specific problem: the ATM network and the direct debit network were separate systems with separate ledger reads. Neither knew the other was running.

The Fix

Option 1 — Pessimistic locking (lock first, act second):

SQL

BEGIN;

-- Lock this row — all other transactions must wait
SELECT balance FROM accounts 
WHERE account_id = 123 
FOR UPDATE;

-- Now safe to check and deduct
UPDATE accounts 
SET balance = balance - 19000 
WHERE account_id = 123;

COMMIT;
-- Lock released — next transaction can now proceed

Option 2 — Optimistic locking (detect conflict, retry):

SQL

-- Read with version
SELECT balance, version FROM accounts WHERE account_id = 123;
-- returns: { balance: 20000, version: 5 }

-- Write only if nobody else changed it
UPDATE accounts
SET balance = balance - 19000,
    version = version + 1
WHERE account_id = 123
  AND version = 5;  -- fails if ATM or bill already changed the row

-- rows_affected = 0 → conflict detected → retry or reject

Option 3 — Available balance (simplest, most impactful):

SQL

-- Two fields: what you have vs what you can spend
ALTER TABLE accounts ADD COLUMN available_balance DECIMAL(15,2);
ALTER TABLE accounts ADD COLUMN ledger_balance DECIMAL(15,2);

-- When bill becomes "pending":
UPDATE accounts 
SET available_balance = available_balance - 20000
WHERE account_id = 123;
-- available is now 0 — ATM sees 0, withdrawal rejected cleanly

This is why your bank app shows "Available: £180 / Balance: £200". The difference is pending transactions already deducted from what you can actually spend.

Real-world adoption:

Monzo / Revolut / Starling — event-sourced immutable ledgers, balance is recalculated from append-only transaction history. This specific bug is architecturally impossible.
PayPal — optimistic locking with version columns on their internal ledger
Stripe — idempotency keys on every API mutation, preventing duplicate charges on network retries

2. The N+1 Problem — Slow by Design

The Bug

You load a list of 100 orders. For each order, you load the customer. That's 1 query for orders + 100 queries for customers = 101 database round trips.

At 2ms per query: 100 orders takes 200ms. 1,000 orders takes 2,000ms. Your "fast" API is silently scaling O(n).

TYPESCRIPT

// This looks innocent
async function getOrdersWithCustomers() {
  const orders = await db.query("SELECT * FROM orders LIMIT 100");
  // 1 query ↑

  for (const order of orders) {
    order.customer = await db.query(
      "SELECT * FROM customers WHERE id = $1",
      [order.customer_id]
    );
    // 100 more queries ↑  ← N+1
  }

  return orders;
}

Why It's Hard to Spot

ORMs make this invisible. The query looks like a property access:

TYPESCRIPT

// TypeORM — looks like a simple property read
const orders = await orderRepository.find();
for (const order of orders) {
  console.log(order.customer.name); // ← triggers a SELECT behind the scenes
}

In development with 5 orders it feels instant. In production with 50,000 orders, your database is on fire.

The Fix

Eager loading — fetch everything in one JOIN:

SQL

SELECT 
  o.id, o.total, o.created_at,
  c.id AS customer_id, c.name, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '30 days';

In TypeORM:

TYPESCRIPT

const orders = await orderRepository.find({
  relations: ["customer"],  // ← single JOIN query, not 100 SELECTs
});

Batch loading — load related records in one IN query:

TYPESCRIPT

const orders = await db.query("SELECT * FROM orders LIMIT 100");

// Collect all customer IDs, load in one shot
const customerIds = orders.map(o => o.customer_id);
const customers = await db.query(
  "SELECT * FROM customers WHERE id = ANY($1)",
  [customerIds]
);

// Map back
const customerMap = Object.fromEntries(customers.map(c => [c.id, c]));
for (const order of orders) {
  order.customer = customerMap[order.customer_id];
}
// Total: 2 queries regardless of N

DataLoader pattern (used by GraphQL servers):

TYPESCRIPT

import DataLoader from "dataloader";

const customerLoader = new DataLoader(async (ids: readonly string[]) => {
  const customers = await db.query(
    "SELECT * FROM customers WHERE id = ANY($1)", [ids]
  );
  return ids.map(id => customers.find(c => c.id === id));
});

// Now each customer lookup is batched automatically
const customer = await customerLoader.load(order.customer_id);

Real-world impact:

GitHub famously fixed N+1 queries in their pull request timeline — page load dropped from 4s to 400ms
Shopify tracks N+1 as a first-class metric in their performance budget
Facebook built DataLoader specifically because GraphQL field resolvers create N+1 by default

3. Throttling Failures — When You Trust the Other Side

The Bug

You call a third-party API in a tight loop. The API returns 429 Too Many Requests. Your code crashes, retries immediately, gets throttled again, crashes again. Or worse — it silently drops data.

TYPESCRIPT

// Fetching prices for 10,000 products
async function syncAllPrices(productIds: string[]) {
  for (const id of productIds) {
    const price = await pricingApi.getPrice(id); // ← no rate limiting
    await db.update("products", { price }, { id });
  }
}
// At 100 products: fine
// At 10,000 products: 429 errors from minute 1
// API bans your key after sustained abuse

The Real Pattern: Thundering Herd

A related bug: your cache expires at midnight. 50,000 users hit your site at 00:00:01. Every request misses the cache. Every request hits the database simultaneously. Database falls over.

00:00:00 — cache valid, 50k users → cache hits → database quiet
00:00:01 — cache expires
00:00:01 — 50,000 simultaneous requests → cache miss → 50,000 DB queries
00:00:01 — database CPU: 100%, connections exhausted, timeouts begin

The Fix

Exponential backoff with jitter:

TYPESCRIPT

async function callWithRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      if (err.status !== 429 || attempt === maxRetries) throw err;
      
      // Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600ms
      // + random jitter to prevent synchronized retries
      const base = Math.pow(2, attempt) * 100;
      const jitter = Math.random() * 100;
      await sleep(base + jitter);
    }
  }
  throw new Error("Max retries exceeded");
}

Token bucket rate limiter:

TYPESCRIPT

class RateLimiter {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,
    private refillRate: number  // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  async acquire(): Promise<void> {
    // Refill tokens based on elapsed time
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;

    if (this.tokens >= 1) {
      this.tokens--;
      return;
    }

    // Wait until a token is available
    const waitMs = ((1 - this.tokens) / this.refillRate) * 1000;
    await sleep(waitMs);
    this.tokens = 0;
  }
}

const limiter = new RateLimiter(10, 10); // 10 req/sec max

for (const id of productIds) {
  await limiter.acquire();
  const price = await pricingApi.getPrice(id);
  await db.update("products", { price }, { id });
}

Cache stampede prevention (stale-while-revalidate):

TYPESCRIPT

async function getWithSWR(key: string, fetchFn: () => Promise<any>) {
  const cached = await redis.get(key);
  
  if (cached) {
    const { value, expiresAt } = JSON.parse(cached);
    
    // If expiring soon, refresh in background — serve stale immediately
    if (expiresAt - Date.now() < 30_000) {
      fetchFn().then(fresh => {
        redis.set(key, JSON.stringify({ value: fresh, expiresAt: Date.now() + 300_000 }));
      });
    }
    
    return value; // ← always return immediately, never stampede
  }

  // True miss — one request fetches, others wait on a lock
  const lock = await redis.set(`lock:${key}`, "1", "NX", "EX", 10);
  if (!lock) {
    await sleep(100);
    return getWithSWR(key, fetchFn); // retry after lock holder fills cache
  }

  const value = await fetchFn();
  await redis.set(key, JSON.stringify({ value, expiresAt: Date.now() + 300_000 }));
  await redis.del(`lock:${key}`);
  return value;
}

Real-world impact:

Reddit experienced a cascading throttle failure in 2023 when their API rate limit changes caused third-party apps to hammer retries simultaneously
Twitter/X API throttling during high-profile events caused downstream app failures that looked like the apps were broken, not the API
AWS SDK builds exponential backoff with jitter into all clients by default after internal incidents

4. Double-Write / Split-Brain — Two Sources of Truth

The Bug

You write to a database and a cache. One succeeds, one fails. Now they disagree forever.

TYPESCRIPT

async function updateUserProfile(userId: string, data: Partial<User>) {
  await db.update("users", data, { id: userId }); // ← succeeds
  await redis.set(`user:${userId}`, JSON.stringify(data)); // ← network timeout
  
  // DB has new email, cache has old email
  // Every cached read returns stale data
  // Cache TTL is 24 hours
  // User is confused for 24 hours
}

The same bug at a larger scale: you write to two databases for redundancy. A network partition means writes reach one but not the other. Now you have two databases that disagree on the truth. This is called split-brain.

The Fix

Cache-aside (read-through, write-invalidate):

TYPESCRIPT

async function updateUserProfile(userId: string, data: Partial<User>) {
  await db.update("users", data, { id: userId });
  
  // Don't update cache — DELETE it
  // Next read will fetch from DB and repopulate
  await redis.del(`user:${userId}`);
  
  // If redis.del fails, stale cache eventually expires
  // DB is always authoritative
}

async function getUserProfile(userId: string) {
  const cached = await redis.get(`user:${userId}`);
  if (cached) return JSON.parse(cached);
  
  // Cache miss — fetch from DB, store with TTL
  const user = await db.findOne("users", { id: userId });
  await redis.set(`user:${userId}`, JSON.stringify(user), "EX", 3600);
  return user;
}

Outbox pattern (guaranteed dual writes):

TYPESCRIPT

// Both writes in a single DB transaction
await db.transaction(async (trx) => {
  await trx.update("users", data, { id: userId });
  
  // Write the cache update as a pending event in the DB
  await trx.insert("outbox", {
    type: "USER_UPDATED",
    payload: JSON.stringify({ userId, data }),
    processed: false,
  });
  // Transaction commits both or neither
});

// Separate background process reads outbox and updates cache/other systems
// If it fails, it retries — outbox row stays until processed

Real-world impact:

LinkedIn had a split-brain incident in 2011 where two data centres diverged. Users saw different profile data depending on which DC served their request
Slack uses the outbox pattern for all cross-service writes to guarantee eventual consistency without distributed transactions

5. Memory Leaks — The Slow Suffocation

The Bug

Something holds a reference to memory that should have been freed. The process grows slowly, eventually consuming all available RAM, then crashes or is OOM-killed.

TYPESCRIPT

// Classic Node.js leak: event listener never removed
class OrderProcessor extends EventEmitter {
  processOrder(orderId: string) {
    // This listener is added every time processOrder is called
    // It's never removed
    this.on("orderComplete", (id) => {
      if (id === orderId) {
        // handle completion
      }
    });
  }
}

const processor = new OrderProcessor();
// Called 100,000 times → 100,000 listeners accumulate → heap grows forever

TYPESCRIPT

// Another classic: closures capturing large objects
const cache = new Map();

function processImage(imageId: string, imageData: Buffer) {
  const processed = heavyTransform(imageData); // 50MB
  
  cache.set(imageId, {
    result: processed,
    cleanup: () => { /* captures imageData — 50MB never freed */ }
  });
}
// After 100 images: 5GB in cache, never evicted

The Fix

TYPESCRIPT

// Always remove listeners
class OrderProcessor extends EventEmitter {
  processOrder(orderId: string) {
    const handler = (id: string) => {
      if (id === orderId) {
        this.off("orderComplete", handler); // ← remove after use
      }
    };
    this.on("orderComplete", handler);
  }
}

// Bounded cache with LRU eviction
import { LRUCache } from "lru-cache";

const cache = new LRUCache<string, Buffer>({
  max: 100,         // max 100 items
  maxSize: 500_000_000, // max 500MB total
  sizeCalculation: (value) => value.length,
  ttl: 1000 * 60 * 10, // 10 minutes
});

Real-world impact:

Firefox had persistent memory leak bugs in early versions that became memes — browsers that consumed gigabytes of RAM after a few hours
Node.js production services commonly leak through uncleaned intervals, global Maps, and event listener accumulation

6. Time-of-Check to Time-of-Use (TOCTOU) — The Check That Lies

The Bug

You check a condition. By the time you act on it, the condition has changed.

TYPESCRIPT

// File system TOCTOU — classic security vulnerability
async function saveUpload(filename: string, content: Buffer) {
  // CHECK: does file exist?
  const exists = await fs.access(filename).then(() => true).catch(() => false);
  
  if (!exists) {
    // USE: create file
    // ← an attacker can create a symlink HERE, between check and write
    await fs.writeFile(filename, content);
    // Now writing to wherever the symlink points — /etc/passwd?
  }
}

TYPESCRIPT

// Database TOCTOU — the exact Nordea pattern
async function withdrawMoney(accountId: string, amount: number) {
  // CHECK
  const { balance } = await db.query(
    "SELECT balance FROM accounts WHERE id = $1", [accountId]
  );
  
  if (balance >= amount) {
    // ← another withdrawal can happen HERE
    // USE
    await db.query(
      "UPDATE accounts SET balance = balance - $1 WHERE id = $2",
      [amount, accountId]
    );
  }
}

The Fix

Collapse check and act into a single atomic operation:

TYPESCRIPT

// Atomic check-and-update — no window for race
const result = await db.query(`
  UPDATE accounts 
  SET balance = balance - $1
  WHERE id = $2 AND balance >= $1
  RETURNING balance
`, [amount, accountId]);

if (result.rows.length === 0) {
  throw new Error("Insufficient funds or account not found");
}
// If it returned a row, the deduction succeeded and balance was sufficient
// These two facts are atomically guaranteed

7. Deadlocks — Two Threads Waiting for Each Other Forever

The Bug

Thread A holds lock on Resource 1, wants Resource 2. Thread B holds lock on Resource 2, wants Resource 1. Both wait forever.

SQL

-- Transaction A                    -- Transaction B
BEGIN;                               BEGIN;
UPDATE accounts                      UPDATE accounts
  SET balance = balance - 100          SET balance = balance - 200
  WHERE id = 1;  -- locks row 1         WHERE id = 2;  -- locks row 2

UPDATE accounts                      UPDATE accounts
  SET balance = balance + 100          SET balance = balance + 200
  WHERE id = 2;  -- BLOCKS             WHERE id = 1;  -- BLOCKS
-- waiting for B to release row 2   -- waiting for A to release row 1
-- DEADLOCK

The Fix

Always acquire locks in the same order:

TYPESCRIPT

async function transferMoney(fromId: string, toId: string, amount: number) {
  // Always lock lower ID first — consistent ordering across all callers
  const [firstId, secondId] = [fromId, toId].sort();
  
  await db.transaction(async (trx) => {
    // Both transactions always lock accounts in ascending ID order
    // Deadlock is now impossible — no circular wait
    await trx.query("SELECT 1 FROM accounts WHERE id = $1 FOR UPDATE", [firstId]);
    await trx.query("SELECT 1 FROM accounts WHERE id = $2 FOR UPDATE", [secondId]);
    
    await trx.query("UPDATE accounts SET balance = balance - $1 WHERE id = $2", [amount, fromId]);
    await trx.query("UPDATE accounts SET balance = balance + $1 WHERE id = $2", [amount, toId]);
  });
}

Real-world impact:

MySQL deadlock logs are one of the most common things DBAs investigate in e-commerce systems
PostgreSQL detects deadlocks automatically and kills one transaction with ERROR: deadlock detected — but the business operation is lost

Summary: The Pattern They All Share

Every bug in this article follows the same root cause:

You assumed a condition would remain true between when you checked it and when you acted on it.

| Bug | Assumption That Fails | |---|---| | Race condition | "Balance hasn't changed since I read it" | | N+1 | "Loading each record separately is fine at scale" | | Throttling | "The API will keep accepting my requests" | | Double-write | "Both systems will accept my write" | | Memory leak | "This object will be freed when I'm done with it" | | TOCTOU | "The state hasn't changed between check and act" | | Deadlock | "The other thread will finish before I need its lock" |

The fix in every case: make the check and the act a single atomic operation, or design so the check is unnecessary.

Checklist Before You Ship

□ Any place two operations share state without a transaction?       → Race condition
□ Loading a list then querying each item individually?              → N+1
□ Calling external APIs without rate limiting or retry backoff?     → Throttling
□ Writing to DB + cache/queue without rollback on partial failure?  → Double-write
□ Event listeners or timers without cleanup on component teardown?  → Memory leak
□ Checking a condition then acting on it in separate statements?    → TOCTOU
□ Multiple locks acquired in different orders across code paths?    → Deadlock

These bugs don't announce themselves. Write the checklist into your PR template — they're the ones that call you at 3am, or show up three months later in a bank statement.

Concurrency Bugs That Cost Real Money: Race Conditions, N+1, Throttling & More

1. Race Conditions — The Bug That Emptied a Bank Account

The Real Story

What Actually Happened

Why It Happens

The Fix

2. The N+1 Problem — Slow by Design

The Bug

Why It's Hard to Spot

The Fix

3. Throttling Failures — When You Trust the Other Side

The Bug

The Real Pattern: Thundering Herd

The Fix

4. Double-Write / Split-Brain — Two Sources of Truth

The Bug

The Fix

5. Memory Leaks — The Slow Suffocation

The Bug

The Fix

6. Time-of-Check to Time-of-Use (TOCTOU) — The Check That Lies

The Bug

The Fix

7. Deadlocks — Two Threads Waiting for Each Other Forever

The Bug

The Fix

Summary: The Pattern They All Share

Checklist Before You Ship

Enjoyed this article?

Leave a comment