Case Study: Memory Leak in Production — Finding It with dotnet-dump and dotMemory

Service: B2B document processing API, ~200 req/s peak
Stack: .NET 9, ASP.NET Core, EF Core, Redis, Azure Container Apps
Timeline: Noticed Week 1, diagnosed Week 2, fixed in 4 hours

This is the investigation from beginning to fix — including the wrong leads followed before finding the real causes.

The Symptoms

Week 1 of our new container deployment, Grafana shows the working set climbing:

Monday 09:00  — 180 MB
Monday 18:00  — 340 MB
Tuesday 09:00 — 520 MB   ← restart triggered by OOM killer
Wednesday 09:00 — 180 MB (fresh start)
Wednesday 18:00 — 360 MB ← same pattern

The container restarts at ~600 MB (our memory limit). After restart, the app responds normally — no data loss because we use PostgreSQL for everything. But restart takes 8–12 seconds and drops ~20 requests.

CPU stays flat. Request latency stays normal until about 2 hours before OOM, when GC starts running constantly trying to reclaim space.

First wrong assumption: "It's probably the EF Core DbContext being held too long." We checked DbContext lifetime — it was correctly scoped. Not the cause.

Second wrong assumption: "Redis client is accumulating connections." We checked connection pool — healthy. Not the cause.

Getting a Memory Dump

We needed to see what was actually consuming memory before it OOMed. Two options:

Option 1: dotnet-dump (CLI, no restart needed)

Bash

# Install on the container
dotnet tool install --global dotnet-dump

# Find the process ID
dotnet-dump ps

# Capture a full dump (takes ~30 seconds, process pauses briefly)
dotnet-dump collect --process-id 1 --output /tmp/memdump.dmp

Copy the dump to your local machine:

Bash

az container exec --resource-group myapp --name myapp-api \
  --command "dotnet-dump collect -p 1 -o /tmp/dump.dmp"

# Or via kubectl
kubectl cp myapp-pod:/tmp/dump.dmp ./dump.dmp

Option 2: dotMemory (JetBrains, GUI-based)

Attach dotMemory to a running process on your local dev machine that reproduces the leak. Easier to use for the analysis step — the GUI shows type counts and reference trees clearly.

We took the dump at 500 MB (about 90 minutes before OOM) and analysed it with both tools.

Analysis: What's Eating the Memory?

With dotnet-dump

Bash

dotnet-dump analyze dump.dmp

# Show top types by object count
> dumpheap -stat

       MT    Count    TotalSize Class Name
...
7f3a1bc4  128,432   12,203,040 System.String
7f3b2d44   89,211   35,684,400 MyApp.Domain.DocumentMetadata
7f2c4512   89,210    7,136,800 MyApp.Infrastructure.Cache.DocumentCacheEntry
7f4d1230   89,208    8,028,720 System.Collections.Generic.Dictionary`2
...

89,000+ DocumentMetadata objects alive on the heap is immediately suspicious. Our document cache should hold at most a few hundred recent documents.

# Show all instances of DocumentCacheEntry and who's holding them
> dumpheap -type DocumentCacheEntry

         Address MT  Size
7f200001000 7f2c4512   80
7f200001050 7f2c4512   80
...

# Follow the reference chain for one instance
> gcroot 7f200001000

Thread 1:
  MyApp.Infrastructure.Cache.DocumentCache._entries
    -> Dictionary
      -> DocumentCacheEntry

The DocumentCache._entries dictionary is the root. 89,000 entries in a cache that should have at most 500. Found it.

With dotMemory

The dotMemory snapshot view shows the same picture visually: DocumentCacheEntry objects grouped under DocumentCache._entries, with a "Dominators" view showing the cache is holding 34% of all live memory.

Root Cause 1: Unbounded Static Cache

The cache was implemented as a static ConcurrentDictionary with no eviction:

// Infrastructure/Cache/DocumentCache.cs — THE BUG
public class DocumentCache
{
    // Static — lives for the process lifetime, never evicted
    private static readonly ConcurrentDictionary<string, DocumentCacheEntry> _entries = new();

    public void Store(string documentId, DocumentMetadata metadata)
    {
        _entries[documentId] = new DocumentCacheEntry(metadata, DateTime.UtcNow);
        // No size limit, no expiry, no eviction
    }

    public bool TryGet(string documentId, out DocumentMetadata? metadata)
    {
        if (_entries.TryGetValue(documentId, out var entry))
        {
            metadata = entry.Metadata;
            return true;
        }
        metadata = null;
        return false;
    }
}

At 200 req/s, processing unique documents, this grows by ~200 entries per second. After 7 hours: ~5 million entries (we saw ~89K because not all documents had unique IDs in our test set).

Fix:

// Option A: Use MemoryCache with size limit and expiry
public class DocumentCache
{
    private readonly IMemoryCache _cache;

    public DocumentCache(IMemoryCache cache) => _cache = cache;

    public void Store(string documentId, DocumentMetadata metadata)
    {
        var options = new MemoryCacheEntryOptions()
            .SetSlidingExpiration(TimeSpan.FromMinutes(30))
            .SetAbsoluteExpiration(TimeSpan.FromHours(2))
            .SetSize(1);  // requires cache.SizeLimit to be set

        _cache.Set(documentId, metadata, options);
    }

    public bool TryGet(string documentId, out DocumentMetadata? metadata) =>
        _cache.TryGetValue(documentId, out metadata);
}

// Program.cs
builder.Services.AddMemoryCache(options =>
{
    options.SizeLimit = 10_000;  // max 10K entries
});

Root Cause 2: Event Handler Never Unsubscribed

The memory dump showed a second leak hidden by the cache one. After fixing the cache and re-running:

> dumpheap -stat

       MT    Count    TotalSize Class Name
...
7f8a3c44   12,441    3,981,120 MyApp.Api.Middleware.RequestTracingMiddleware

12,000+ RequestTracingMiddleware instances alive. Middleware instances should not accumulate.

> gcroot 7f8a3c44a
 
MyApp.Infrastructure.Events.DocumentEventBus._handlers
  -> List>
    -> RequestTracingMiddleware.OnDocumentProcessed   ← event handler
      -> RequestTracingMiddleware

The middleware subscribed to a static event bus when created, but never unsubscribed. Since the event bus is a singleton (static _handlers), it holds a reference to every middleware instance created — preventing GC:

// Api/Middleware/RequestTracingMiddleware.cs — THE BUG
public class RequestTracingMiddleware : IMiddleware
{
    private readonly IDocumentEventBus _bus;

    public RequestTracingMiddleware(IDocumentEventBus bus)
    {
        _bus = bus;
        // Subscribe — but where does the unsubscribe go?
        _bus.DocumentProcessed += OnDocumentProcessed;
    }

    private void OnDocumentProcessed(object? sender, DocumentProcessedEvent e)
    {
        // Log the event for request tracing
    }

    // No Dispose method — subscription never released
}

ASP.NET Core creates a new IMiddleware instance per request when registered with AddScoped. The event bus (singleton) holds a reference to every single middleware instance ever created. After 8 hours of traffic: tens of thousands of middleware objects alive.

Fix:

public class RequestTracingMiddleware : IMiddleware, IDisposable
{
    private readonly IDocumentEventBus _bus;
    private bool _disposed;

    public RequestTracingMiddleware(IDocumentEventBus bus)
    {
        _bus = bus;
        _bus.DocumentProcessed += OnDocumentProcessed;
    }

    private void OnDocumentProcessed(object? sender, DocumentProcessedEvent e)
    {
        // ... tracing logic ...
    }

    public void Dispose()
    {
        if (!_disposed)
        {
            _bus.DocumentProcessed -= OnDocumentProcessed;
            _disposed = true;
        }
    }
}

And register with AddScoped so DI calls Dispose at end of request scope:

builder.Services.AddScoped<RequestTracingMiddleware>();

Alternatively, restructure to avoid instance-level subscriptions entirely — use a static handler or a mediator pattern where the middleware is stateless.

Verifying the Fix

After deploying the fix:

Day 1  09:00 — 180 MB
Day 1  18:00 — 195 MB   ← barely moved (was 340 MB before)
Day 2  09:00 — 188 MB
Day 2  18:00 — 198 MB   ← stable
Week 2 09:00 — 190 MB   ← no growth after a week

We added metrics to confirm:

// Prometheus gauges for cache and middleware
builder.Services.AddSingleton<IMeterFactory, MeterFactory>();

// In the cache service
_meter.CreateObservableGauge("app.cache.entries",
    () => _entries.Count,
    description: "Number of items in document cache");

// Memory total
_meter.CreateObservableGauge("app.memory.working_set_mb",
    () => Process.GetCurrentProcess().WorkingSet64 / 1_048_576.0,
    description: "Process working set in MB");

Prevention: What We Changed

1. Memory alert before OOM:

YAML

# Azure Monitor alert
condition: avg(app.memory.working_set_mb) > 450  # alert at 75% of 600 MB limit

Gives ~2 hours of warning before restart — enough time to take a dump during the leak rather than after a restart.

2. Static cache audit:

We ran a code search for static.*Dictionary and static.*List and static.*ConcurrentDictionary — any unbounded static collection is a leak candidate:

Bash

grep -rn "static.*Dictionary\|static.*ConcurrentDictionary\|static.*List<" src/

Found three more static caches with no eviction — fixed them all.

3. Event subscription convention:

Added a custom Roslyn analyser rule (or at minimum a PR checklist item): any class that subscribes to an event in its constructor must implement IDisposable and unsubscribe. Alternatively, use WeakReference event patterns or mediator/observer patterns that don't create strong reference chains.

4. Weekly memory baseline:

Added a scheduled Grafana snapshot comparing working set at 09:00 Monday each week. A steady week-over-week increase is the earliest sign of a new leak.

Tools Reference

| Tool | When to use | |---|---| | dotnet-dump collect | Capture dump from a live container without stopping it | | dotnet-dump analyze | CLI analysis: dumpheap -stat, gcroot | | JetBrains dotMemory | GUI analysis: dominators view, type diff between two snapshots | | PerfView | Windows only, excellent for GC and allocation tracing | | dotnet-counters | Live metrics: gc-heap-size, alloc-rate, working-set — no dump needed | | dotnet-gcdump | Lighter than full dump, captures GC roots only — faster to collect |

Bash

# Live monitoring without a dump
dotnet-counters monitor --process-id 1 \
  --counters System.Runtime[gc-heap-size,alloc-rate,working-set,gen-0-gc-count,gen-2-gc-count]

A rising gen-2-gc-count with flat alloc-rate but rising gc-heap-size is the signature of a leak — objects are surviving to Gen 2 instead of being collected.

Case Study: Memory Leak in Production — Finding It with dotnet-dump and dotMemory

Case Study: Memory Leak in Production — Finding It with dotnet-dump and dotMemory

The Symptoms

Getting a Memory Dump

Analysis: What's Eating the Memory?

With dotnet-dump

With dotMemory

Root Cause 1: Unbounded Static Cache

Root Cause 2: Event Handler Never Unsubscribed

Verifying the Fix

Prevention: What We Changed

Tools Reference

Enjoyed this article?

Leave a comment