.NET & C# Development · Lesson 201 of 229
Case Study: Memory Leak in Production — Finding It with dotnet-dump and dotMemory
Case Study: Memory Leak in Production — Finding It with dotnet-dump and dotMemory
Service: B2B document processing API, ~200 req/s peak
Stack: .NET 9, ASP.NET Core, EF Core, Redis, Azure Container Apps
Timeline: Noticed Week 1, diagnosed Week 2, fixed in 4 hours
This is the investigation from beginning to fix — including the wrong leads followed before finding the real causes.
The Symptoms
Week 1 of our new container deployment, Grafana shows the working set climbing:
Monday 09:00 — 180 MB
Monday 18:00 — 340 MB
Tuesday 09:00 — 520 MB ← restart triggered by OOM killer
Wednesday 09:00 — 180 MB (fresh start)
Wednesday 18:00 — 360 MB ← same patternThe container restarts at ~600 MB (our memory limit). After restart, the app responds normally — no data loss because we use PostgreSQL for everything. But restart takes 8–12 seconds and drops ~20 requests.
CPU stays flat. Request latency stays normal until about 2 hours before OOM, when GC starts running constantly trying to reclaim space.
First wrong assumption: "It's probably the EF Core DbContext being held too long." We checked DbContext lifetime — it was correctly scoped. Not the cause.
Second wrong assumption: "Redis client is accumulating connections." We checked connection pool — healthy. Not the cause.
Getting a Memory Dump
We needed to see what was actually consuming memory before it OOMed. Two options:
Option 1: dotnet-dump (CLI, no restart needed)
# Install on the container
dotnet tool install --global dotnet-dump
# Find the process ID
dotnet-dump ps
# Capture a full dump (takes ~30 seconds, process pauses briefly)
dotnet-dump collect --process-id 1 --output /tmp/memdump.dmpCopy the dump to your local machine:
az container exec --resource-group myapp --name myapp-api \
--command "dotnet-dump collect -p 1 -o /tmp/dump.dmp"
# Or via kubectl
kubectl cp myapp-pod:/tmp/dump.dmp ./dump.dmpOption 2: dotMemory (JetBrains, GUI-based)
Attach dotMemory to a running process on your local dev machine that reproduces the leak. Easier to use for the analysis step — the GUI shows type counts and reference trees clearly.
We took the dump at 500 MB (about 90 minutes before OOM) and analysed it with both tools.
Analysis: What's Eating the Memory?
With dotnet-dump
dotnet-dump analyze dump.dmp# Show top types by object count
> dumpheap -stat
MT Count TotalSize Class Name
...
7f3a1bc4 128,432 12,203,040 System.String
7f3b2d44 89,211 35,684,400 MyApp.Domain.DocumentMetadata
7f2c4512 89,210 7,136,800 MyApp.Infrastructure.Cache.DocumentCacheEntry
7f4d1230 89,208 8,028,720 System.Collections.Generic.Dictionary`2
...89,000+ DocumentMetadata objects alive on the heap is immediately suspicious. Our document cache should hold at most a few hundred recent documents.
# Show all instances of DocumentCacheEntry and who's holding them
> dumpheap -type DocumentCacheEntry
Address MT Size
7f200001000 7f2c4512 80
7f200001050 7f2c4512 80
...
# Follow the reference chain for one instance
> gcroot 7f200001000
Thread 1:
MyApp.Infrastructure.Cache.DocumentCache._entries
-> Dictionary
-> DocumentCacheEntry The DocumentCache._entries dictionary is the root. 89,000 entries in a cache that should have at most 500. Found it.
With dotMemory
The dotMemory snapshot view shows the same picture visually: DocumentCacheEntry objects grouped under DocumentCache._entries, with a "Dominators" view showing the cache is holding 34% of all live memory.
Root Cause 1: Unbounded Static Cache
The cache was implemented as a static ConcurrentDictionary with no eviction:
// Infrastructure/Cache/DocumentCache.cs — THE BUG
public class DocumentCache
{
// Static — lives for the process lifetime, never evicted
private static readonly ConcurrentDictionary<string, DocumentCacheEntry> _entries = new();
public void Store(string documentId, DocumentMetadata metadata)
{
_entries[documentId] = new DocumentCacheEntry(metadata, DateTime.UtcNow);
// No size limit, no expiry, no eviction
}
public bool TryGet(string documentId, out DocumentMetadata? metadata)
{
if (_entries.TryGetValue(documentId, out var entry))
{
metadata = entry.Metadata;
return true;
}
metadata = null;
return false;
}
}At 200 req/s, processing unique documents, this grows by ~200 entries per second. After 7 hours: ~5 million entries (we saw ~89K because not all documents had unique IDs in our test set).
Fix:
// Option A: Use MemoryCache with size limit and expiry
public class DocumentCache
{
private readonly IMemoryCache _cache;
public DocumentCache(IMemoryCache cache) => _cache = cache;
public void Store(string documentId, DocumentMetadata metadata)
{
var options = new MemoryCacheEntryOptions()
.SetSlidingExpiration(TimeSpan.FromMinutes(30))
.SetAbsoluteExpiration(TimeSpan.FromHours(2))
.SetSize(1); // requires cache.SizeLimit to be set
_cache.Set(documentId, metadata, options);
}
public bool TryGet(string documentId, out DocumentMetadata? metadata) =>
_cache.TryGetValue(documentId, out metadata);
}
// Program.cs
builder.Services.AddMemoryCache(options =>
{
options.SizeLimit = 10_000; // max 10K entries
});Root Cause 2: Event Handler Never Unsubscribed
The memory dump showed a second leak hidden by the cache one. After fixing the cache and re-running:
> dumpheap -stat
MT Count TotalSize Class Name
...
7f8a3c44 12,441 3,981,120 MyApp.Api.Middleware.RequestTracingMiddleware12,000+ RequestTracingMiddleware instances alive. Middleware instances should not accumulate.
> gcroot 7f8a3c44a
MyApp.Infrastructure.Events.DocumentEventBus._handlers
-> List>
-> RequestTracingMiddleware.OnDocumentProcessed ← event handler
-> RequestTracingMiddleware The middleware subscribed to a static event bus when created, but never unsubscribed. Since the event bus is a singleton (static _handlers), it holds a reference to every middleware instance created — preventing GC:
// Api/Middleware/RequestTracingMiddleware.cs — THE BUG
public class RequestTracingMiddleware : IMiddleware
{
private readonly IDocumentEventBus _bus;
public RequestTracingMiddleware(IDocumentEventBus bus)
{
_bus = bus;
// Subscribe — but where does the unsubscribe go?
_bus.DocumentProcessed += OnDocumentProcessed;
}
private void OnDocumentProcessed(object? sender, DocumentProcessedEvent e)
{
// Log the event for request tracing
}
// No Dispose method — subscription never released
}ASP.NET Core creates a new IMiddleware instance per request when registered with AddScoped. The event bus (singleton) holds a reference to every single middleware instance ever created. After 8 hours of traffic: tens of thousands of middleware objects alive.
Fix:
public class RequestTracingMiddleware : IMiddleware, IDisposable
{
private readonly IDocumentEventBus _bus;
private bool _disposed;
public RequestTracingMiddleware(IDocumentEventBus bus)
{
_bus = bus;
_bus.DocumentProcessed += OnDocumentProcessed;
}
private void OnDocumentProcessed(object? sender, DocumentProcessedEvent e)
{
// ... tracing logic ...
}
public void Dispose()
{
if (!_disposed)
{
_bus.DocumentProcessed -= OnDocumentProcessed;
_disposed = true;
}
}
}And register with AddScoped so DI calls Dispose at end of request scope:
builder.Services.AddScoped<RequestTracingMiddleware>();Alternatively, restructure to avoid instance-level subscriptions entirely — use a static handler or a mediator pattern where the middleware is stateless.
Verifying the Fix
After deploying the fix:
Day 1 09:00 — 180 MB
Day 1 18:00 — 195 MB ← barely moved (was 340 MB before)
Day 2 09:00 — 188 MB
Day 2 18:00 — 198 MB ← stable
Week 2 09:00 — 190 MB ← no growth after a weekWe added metrics to confirm:
// Prometheus gauges for cache and middleware
builder.Services.AddSingleton<IMeterFactory, MeterFactory>();
// In the cache service
_meter.CreateObservableGauge("app.cache.entries",
() => _entries.Count,
description: "Number of items in document cache");
// Memory total
_meter.CreateObservableGauge("app.memory.working_set_mb",
() => Process.GetCurrentProcess().WorkingSet64 / 1_048_576.0,
description: "Process working set in MB");Prevention: What We Changed
1. Memory alert before OOM:
# Azure Monitor alert
condition: avg(app.memory.working_set_mb) > 450 # alert at 75% of 600 MB limitGives ~2 hours of warning before restart — enough time to take a dump during the leak rather than after a restart.
2. Static cache audit:
We ran a code search for static.*Dictionary and static.*List and static.*ConcurrentDictionary — any unbounded static collection is a leak candidate:
grep -rn "static.*Dictionary\|static.*ConcurrentDictionary\|static.*List<" src/Found three more static caches with no eviction — fixed them all.
3. Event subscription convention:
Added a custom Roslyn analyser rule (or at minimum a PR checklist item): any class that subscribes to an event in its constructor must implement IDisposable and unsubscribe. Alternatively, use WeakReference event patterns or mediator/observer patterns that don't create strong reference chains.
4. Weekly memory baseline:
Added a scheduled Grafana snapshot comparing working set at 09:00 Monday each week. A steady week-over-week increase is the earliest sign of a new leak.
Tools Reference
| Tool | When to use |
|---|---|
| dotnet-dump collect | Capture dump from a live container without stopping it |
| dotnet-dump analyze | CLI analysis: dumpheap -stat, gcroot |
| JetBrains dotMemory | GUI analysis: dominators view, type diff between two snapshots |
| PerfView | Windows only, excellent for GC and allocation tracing |
| dotnet-counters | Live metrics: gc-heap-size, alloc-rate, working-set — no dump needed |
| dotnet-gcdump | Lighter than full dump, captures GC roots only — faster to collect |
# Live monitoring without a dump
dotnet-counters monitor --process-id 1 \
--counters System.Runtime[gc-heap-size,alloc-rate,working-set,gen-0-gc-count,gen-2-gc-count]A rising gen-2-gc-count with flat alloc-rate but rising gc-heap-size is the signature of a leak — objects are surviving to Gen 2 instead of being collected.