Production Debugging in .NET: Mindset, Tools, and Techniques
How to diagnose and fix production issues in .NET. Covers structured logging, distributed tracing, memory dumps, dotnet-trace, dotnet-counters, mini-profiler, common failure patterns, and the production debugging mindset.
The Production Debugging Mindset
The worst thing you can do in a production incident is start guessing. The best engineers follow a structured approach:
- Understand the symptom ā what is broken? (latency spike, errors, OOM crash?)
- Narrow the scope ā when did it start? which environment? which endpoints?
- Gather evidence ā logs, metrics, traces ā before touching anything
- Form a hypothesis ā one specific theory, falsifiable
- Test the hypothesis ā one change at a time
- Fix and verify ā confirm the symptom is gone, not just assumed
Debugging production is about reducing uncertainty, not heroics.
Structured Logging with Serilog
Unstructured logs are grep-able. Structured logs are queryable. The difference matters at scale.
// BAD ā unstructured
logger.LogInformation($"Order {orderId} placed by {userId}");
// GOOD ā structured (message template with named properties)
logger.LogInformation("Order {OrderId} placed by {UserId}", orderId, userId);The second form creates a log event with OrderId and UserId as searchable properties in Seq, Elastic, or Application Insights ā not just a formatted string.
Log Levels as Signal
| Level | When to use |
|---|---|
| Trace | Extremely detailed, dev only |
| Debug | Diagnostic info, typically disabled in prod |
| Information | Normal business events (order placed, user logged in) |
| Warning | Something unexpected but recoverable |
| Error | Operation failed, needs attention |
| Critical | System is unusable, immediate action required |
// Log at the right level
logger.LogInformation("Payment processed for OrderId {OrderId}", orderId);
logger.LogWarning("Payment retry {Attempt} for OrderId {OrderId}", attempt, orderId);
logger.LogError(ex, "Payment failed for OrderId {OrderId}", orderId);Correlation IDs for Request Tracing
// Middleware ā assign or forward a correlation ID
app.Use(async (ctx, next) =>
{
var correlationId = ctx.Request.Headers["X-Correlation-Id"]
.FirstOrDefault() ?? Guid.NewGuid().ToString();
ctx.Response.Headers["X-Correlation-Id"] = correlationId;
using (logger.BeginScope(new Dictionary<string, object>
{
["CorrelationId"] = correlationId
}))
{
await next(ctx);
}
});Now every log line within that request includes CorrelationId ā you can find all logs for a specific request across services.
Distributed Tracing with OpenTelemetry
Logs tell you what happened on one service. Traces tell you what happened across all services for a single request.
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.EntityFrameworkCore
dotnet add package OpenTelemetry.Exporter.Jaegerbuilder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("MyApp.*") // custom activity sources
.AddJaegerExporter());Adding Custom Spans
private static readonly ActivitySource _activitySource =
new ActivitySource("MyApp.OrderService");
public async Task ProcessOrderAsync(Guid orderId, CancellationToken ct)
{
using var activity = _activitySource.StartActivity("ProcessOrder");
activity?.SetTag("order.id", orderId);
try
{
await ValidateAsync(orderId, ct);
using var paymentSpan = _activitySource.StartActivity("ChargePayment");
await _paymentService.ChargeAsync(orderId, ct);
paymentSpan?.SetTag("payment.status", "success");
activity?.SetStatus(ActivityStatusCode.Ok);
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
}Metrics with dotnet-counters
dotnet-counters is a real-time CLI dashboard for runtime and custom metrics. No deployment needed.
# Install
dotnet tool install --global dotnet-counters
# Watch a running process
dotnet-counters monitor --process-id <PID> --counters \
System.Runtime,Microsoft.AspNetCore.Hosting
# Key metrics to watch:
# - cpu-usage
# - gc-heap-size
# - threadpool-queue-length
# - active-requests (ASP.NET Core)
# - requests-per-secondCustom Metrics
using System.Diagnostics.Metrics;
public class OrderMetrics
{
private readonly Counter<long> _ordersCreated;
private readonly Histogram<double> _processingTime;
public OrderMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("MyApp.Orders");
_ordersCreated = meter.CreateCounter<long>("orders.created");
_processingTime = meter.CreateHistogram<double>("orders.processing_ms");
}
public void RecordOrderCreated(string region) =>
_ordersCreated.Add(1, new TagList { { "region", region } });
public void RecordProcessingTime(double ms) =>
_processingTime.Record(ms);
}Performance Profiling with dotnet-trace
Capture a CPU profile or allocation trace from a live process ā without attaching a debugger.
# Install
dotnet tool install --global dotnet-trace
# Capture 30 seconds of CPU profile
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
--profile cpu-sampling
# Capture GC events + allocations
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
--clrevents GC,Allocation
# Open the .nettrace file in Visual Studio or PerfViewMemory Dumps
When a process crashes, hangs, or has a suspected memory leak, capture a dump.
# Install
dotnet tool install --global dotnet-dump
# Capture a dump from a running process
dotnet-dump collect --process-id <PID>
# Analyze
dotnet-dump analyze <dump-file>
# Useful commands inside the analyzer
> gcroots <object-address> # what's keeping this object alive?
> dumpheap -stat # heap size by type
> dumpheap -type OrderService # all instances of a type
> threads # thread list
> clrstack # call stack of current threadMini-Profiler (SQL and HTTP Profiling in Dev)
MiniProfiler shows per-request timing for SQL queries, HTTP calls, and custom steps ā visible in the browser.
dotnet add package MiniProfiler.AspNetCore.Mvc
dotnet add package MiniProfiler.EntityFrameworkCorebuilder.Services.AddMiniProfiler(options =>
{
options.RouteBasePath = "/profiler";
options.SqlFormatter = new InlineFormatter();
}).AddEntityFramework();
// In your Razor layout ā shows the profiler widget
@await MiniProfiler.Current.RenderIncludes(ViewContext)Now every request shows SQL query count, timing, and duplicates inline in the browser. This is how you catch N+1 queries in development before they hit production.
Common Production Failure Patterns
Memory Leak
Symptoms: heap size grows continuously, eventually OOM.
Common causes:
- Event handlers not unsubscribed (
+=without-=) - Static collections growing forever
IDisposableobjects not disposed- Captured closures holding large objects
// LEAK ā event handler never removed
_bus.OrderCreated += HandleOrderCreated;
// FIX ā remove in Dispose
public void Dispose() => _bus.OrderCreated -= HandleOrderCreated;Diagnose: dotnet-dump + dumpheap -stat to find the growing type.
Thread Pool Starvation
Symptoms: requests queue up, latency climbs, but CPU is low.
Cause: sync-over-async (.Result, .Wait()) or CPU-bound work on ThreadPool threads, blocking I/O threads.
Diagnose: dotnet-counters ā watch threadpool-queue-length. If it climbs, you have starvation.
dotnet-counters monitor --process-id <PID> \
--counters System.Runtime[threadpool-queue-length,threadpool-thread-count]N+1 Query
Symptoms: requests that load a list are slow; SQL profiler shows 1 query per item.
// N+1 ā loads all orders, then queries customer for EACH
var orders = await dbContext.Orders.ToListAsync();
foreach (var order in orders)
{
var customer = await dbContext.Customers.FindAsync(order.CustomerId); // N queries
}
// Fix ā eager load with Include
var orders = await dbContext.Orders
.Include(o => o.Customer)
.ToListAsync();Diagnose: MiniProfiler, Serilog with EF Core slow query logging, Application Insights dependency tracking.
Connection Pool Exhaustion
Symptoms: SqlException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.
Cause: DbContext or HttpClient not disposed, or too many concurrent requests.
// FIX for HttpClient ā use IHttpClientFactory, never new HttpClient()
builder.Services.AddHttpClient<IProductService, ProductService>();
// FIX for DbContext ā use scoped lifetime (default for EF Core + DI)
builder.Services.AddDbContext<AppDbContext>(...); // scoped by defaultHigh GC Pressure
Symptoms: latency spikes every few seconds, high % Time in GC in counters.
Cause: excessive short-lived allocations (strings, byte arrays, LINQ chains in hot paths).
dotnet-trace collect --process-id <PID> --clrevents GC
# Look for Gen2 GCs ā they pause all threadsFix: ArrayPool<T>, Span<T>, StringBuilder, pre-allocated buffers.
Application Insights / Azure Monitor
In Azure, Application Insights gives you logs, traces, metrics, and exceptions in one place.
builder.Services.AddApplicationInsightsTelemetry(
builder.Configuration["ApplicationInsights:InstrumentationKey"]);Key queries (Kusto):
// Slowest requests in last hour
requests
| where timestamp > ago(1h)
| summarize avg(duration), percentile(duration, 95), count() by name
| order by avg_duration desc
// Exceptions by type
exceptions
| where timestamp > ago(1h)
| summarize count() by type
| order by count_ desc
// Failed dependencies
dependencies
| where success == false
| summarize count() by name, typeIncident Response Checklist
When something breaks in production:
- Check recent deployments ā
git log --since="2 hours ago" - Check error rates ā are they new or ongoing?
- Check logs around the time of first occurrence
- Check infrastructure metrics (CPU, memory, disk I/O)
- Reproduce in staging if possible ā don't debug blind in prod
- Roll back if a recent deployment is the cause
- Fix forward if rollback is worse than the bug
- Write a post-mortem ā what happened, why, what we'd do differently
Interview Questions
Q: How do you find a memory leak in a .NET production service?
Capture a memory dump with dotnet-dump collect, then analyze with dotnet-dump analyze. Run dumpheap -stat to find types with unexpectedly high instance counts or total size. Use gcroots <address> to find what's keeping objects alive. Common culprits: unheld event subscriptions, static caches with unbounded growth, undisposed resources.
Q: What causes thread pool starvation and how do you detect it?
Blocking async calls (.Result, .Wait()) on ThreadPool threads, or long CPU-bound operations. Threads are occupied waiting rather than doing I/O work, so requests queue. Detect with dotnet-counters watching threadpool-queue-length. Fix by making code truly async end-to-end.
Q: How would you diagnose N+1 query problems?
Enable EF Core slow query logging or use MiniProfiler in development. Application Insights dependency tracking works in production. Look for many near-identical SQL statements in a single request trace. Fix with Include(), explicit joins, or batching lookups with a Dictionary.
Q: What is distributed tracing and why does it matter?
A trace follows a request across multiple services, capturing timing for each hop. It answers "why is this request slow?" when the answer spans services. OpenTelemetry with Jaeger or Zipkin propagates a TraceId in HTTP headers ā each service adds spans, and you see the full waterfall.
Q: What is the difference between logging, metrics, and traces? Logs are timestamped events with context (what happened). Metrics are aggregated measurements over time (how many, how fast). Traces follow a request through multiple services (where did the time go). Together they form the three pillars of observability ā you need all three for effective production debugging.
Enjoyed this article?
Explore the Backend Systems learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.