Learnixo
Back to blog
Backend Systemsintermediate

Case Study: The N+1 Query That Took Down Production

A real production outage caused by an EF Core N+1 query: timeline, root cause analysis, the fix, and the monitoring that now catches it before it happens again.

Asma Hafeez KhanMay 25, 20266 min read
.NETC#EF CoreN+1performancepostmortemproduction
Share:𝕏

Case Study: The N+1 Query That Took Down Production

System: B2B order management API — 50 customers, ~5,000 orders/day
Stack: ASP.NET Core 8, EF Core 9, PostgreSQL 16, Azure
Incident duration: 47 minutes
Root cause: N+1 query triggered by a new report endpoint
Impact: All API endpoints degraded; database CPU at 100%


Timeline

08:42  New "monthly report" endpoint deployed to production
       (included in a larger release, not flagged as high-risk)

09:15  First alert: API P95 latency > 5 seconds (threshold: 500ms)
       On-call engineer paged

09:18  Second alert: database CPU > 90%
       Database auto-scaling triggered — max instance size already reached

09:23  On-call checks dashboards: all endpoints degraded, not just the report
       Hypothesis 1: database overloaded — find the hot query

09:31  pg_stat_activity shows 847 active connections
       Maximum pool size: 100 — connection pool exhausted
       All API endpoints queuing for a database connection

09:41  Suspect the new report endpoint — it was the only change
       Roll back the deployment

09:47  Database CPU drops from 98% to 12%
       API latency returns to 40ms P95
       Incident closed

09:48  Post-mortem started

The Code That Caused It

C#
// The new monthly report endpoint — looked innocent in code review
app.MapGet("/api/reports/monthly", async (
    int month,
    int year,
    OrderFlowDbContext db) =>
{
    var orders = await db.Orders
        .Where(o => o.CreatedAt.Month == month && o.CreatedAt.Year == year)
        .ToListAsync();   // fetches all orders for the month

    // ← N+1 starts here
    var report = orders.Select(o => new OrderReportLine(
        OrderId:      o.Id,
        CustomerName: db.Customers.Find(o.CustomerId)!.Name,  // ← 1 query PER ORDER
        Total:        o.Total,
        ItemCount:    o.Lines.Count   // ← navigation property — lazy loading
    )).ToList();

    return Results.Ok(report);
});
For a month with 200 orders:
  1 query:  SELECT * FROM orders WHERE month=... AND year=...     → 200 rows
  200 queries: SELECT * FROM customers WHERE id=1
               SELECT * FROM customers WHERE id=2
               ... (one per order)
  200 queries: SELECT * FROM order_items WHERE order_id=1
               SELECT * FROM order_items WHERE order_id=2
               ... (lazy loading Lines)

Total: 401 database queries per report request

During the month-end reporting window:
  15 finance users opened the report simultaneously
  401 queries × 15 users = 6,015 queries per second
  Normal baseline: ~50 queries per second
  → 120x the normal database load

Why It Passed Code Review

The code review checklist didn't include:
  □ Does this endpoint load an unbounded collection?
  □ Does it access navigation properties in a loop?
  □ Was it tested with production-sized data?

The reviewer saw:
  - Clean LINQ
  - No obvious bugs
  - Correct results on the dev database (12 orders that month)

What they missed:
  - db.Customers.Find() inside Select() runs synchronously,
    blocking the thread and issuing a synchronous DB call
  - o.Lines.Count triggers lazy loading (Lines was not eagerly loaded)
  - The endpoint was never load tested

Root Cause Analysis

Contributing factors:

1. N+1 query pattern
   Primary cause. EF Core lazy loading + synchronous Find() inside LINQ Select().

2. Unbounded result set
   No pagination — returns ALL orders for the month.
   200 orders × 401 queries = linear scaling to disaster.

3. No connection pool limit per tenant
   All 15 users shared the same pool, starving other endpoints.

4. No query timeout
   The long-running report queries held connections for 30+ seconds.

5. No load testing
   The endpoint was only tested on a dev DB with 12 orders.

6. Lazy loading enabled globally
   db.UseLazyLoadingProxies() was in the DbContext configuration —
   a team decision from 2 years ago that nobody questioned.

The Fix

C#
// Fixed endpoint — one query, no N+1
app.MapGet("/api/reports/monthly", async (
    int month,
    int year,
    int page,
    int pageSize,
    OrderFlowDbContext db) =>
{
    var report = await db.Orders
        .AsNoTracking()   // no change tracking needed for reports
        .Where(o => o.CreatedAt.Month == month && o.CreatedAt.Year == year)
        .OrderByDescending(o => o.CreatedAt)
        .Skip((page - 1) * pageSize)
        .Take(pageSize)
        .Select(o => new OrderReportLine(
            OrderId:      o.Id,
            CustomerName: o.Customer.Name,    // JOIN — 0 extra queries
            Total:        o.Total,
            ItemCount:    o.Lines.Count))     // COUNT in SQL — 0 extra queries
        .ToListAsync();

    return Results.Ok(report);
});
SQL
-- EF Core now generates ONE query:
SELECT o.id, c.name, o.total, COUNT(l.id)
FROM orders o
INNER JOIN customers c ON c.id = o.customer_id
LEFT JOIN order_items l ON l.order_id = o.id
WHERE EXTRACT(MONTH FROM o.created_at) = 1
  AND EXTRACT(YEAR FROM o.created_at) = 2026
GROUP BY o.id, c.name, o.total
ORDER BY o.created_at DESC
LIMIT 50 OFFSET 0

Process Changes After the Incident

1. Disabled lazy loading globally

C#
// Before
builder.Services.AddDbContext<AppDbContext>(opts =>
    opts.UseLazyLoadingProxies()       // removed
        .UseNpgsql(connectionString));

// After — explicit eager loading only
builder.Services.AddDbContext<AppDbContext>(opts =>
    opts.UseNpgsql(connectionString));

2. Added query tag tracing

C#
// Tag every query with its handler name — visible in pg_stat_activity
public override async Task<List<T>> ToListAsync(...)
{
    return await base.TagWith(typeof(T).Name).ToListAsync(...);
}

3. Added a N+1 detection integration test

C#
[Fact]
public async Task MonthlyReport_ExecutesExactlyOneQuery()
{
    var queryCount = 0;
    var listener   = new QueryCountListener(ref queryCount);

    // Intercept EF Core commands
    await using var scope = factory.Services.CreateAsyncScope();
    var db = scope.ServiceProvider.GetRequiredService<AppDbContext>();
    db.Database.GetDbConnection().StateChange += (_, _) => queryCount++;

    var response = await client.GetAsync("/api/reports/monthly?month=1&year=2026&page=1&pageSize=50");

    queryCount.Should().Be(1, "report endpoint must execute exactly one query");
}

4. Added slow query alerting

SQL
-- Postgres slow query logging
ALTER SYSTEM SET log_min_duration_statement = '500';   -- log queries > 500ms
SELECT pg_reload_conf();

5. Added connection pool monitoring

C#
// Alert when pool utilisation > 80%
builder.Services.AddHealthChecks()
    .AddCheck<DatabaseConnectionPoolHealthCheck>("db-pool");

Lessons

1. Lazy loading is a footgun at scale.
   Disable it globally. Use .Include() explicitly where needed.

2. "It works on dev" is not "it works in production."
   Load test every endpoint that touches an unbounded collection.

3. Pagination is not optional.
   Any endpoint returning a list needs LIMIT/OFFSET or keyset pagination.

4. Find() inside LINQ Select() is always N+1.
   If you're calling db.Entity.Find() in a loop or in Select(), rewrite as a JOIN.

5. Monitor what matters: query count per request, P99 latency, connection pool usage.
   The outage was detectable 30 minutes before it became critical — we just weren't looking.

Metrics Before and After

                Before fix    After fix
Queries/request    401           1
P95 latency      28,000ms      120ms
DB CPU (peak)      98%          14%
Report load time  30+ sec       0.8 sec
Concurrent users   5 (limit)   100+

Enjoyed this article?

Explore the Backend Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.