Production Debugging in .NET: Mindset, Tools, and Techniques

The Production Debugging Mindset

The worst thing you can do in a production incident is start guessing. The best engineers follow a structured approach:

Understand the symptom — what is broken? (latency spike, errors, OOM crash?)
Narrow the scope — when did it start? which environment? which endpoints?
Gather evidence — logs, metrics, traces — before touching anything
Form a hypothesis — one specific theory, falsifiable
Test the hypothesis — one change at a time
Fix and verify — confirm the symptom is gone, not just assumed

Debugging production is about reducing uncertainty, not heroics.

Structured Logging with Serilog

Unstructured logs are grep-able. Structured logs are queryable. The difference matters at scale.

// BAD — unstructured
logger.LogInformation($"Order {orderId} placed by {userId}");

// GOOD — structured (message template with named properties)
logger.LogInformation("Order {OrderId} placed by {UserId}", orderId, userId);

The second form creates a log event with OrderId and UserId as searchable properties in Seq, Elastic, or Application Insights — not just a formatted string.

Log Levels as Signal

| Level | When to use | |---|---| | Trace | Extremely detailed, dev only | | Debug | Diagnostic info, typically disabled in prod | | Information | Normal business events (order placed, user logged in) | | Warning | Something unexpected but recoverable | | Error | Operation failed, needs attention | | Critical | System is unusable, immediate action required |

// Log at the right level
logger.LogInformation("Payment processed for OrderId {OrderId}", orderId);
logger.LogWarning("Payment retry {Attempt} for OrderId {OrderId}", attempt, orderId);
logger.LogError(ex, "Payment failed for OrderId {OrderId}", orderId);

Correlation IDs for Request Tracing

// Middleware — assign or forward a correlation ID
app.Use(async (ctx, next) =>
{
    var correlationId = ctx.Request.Headers["X-Correlation-Id"]
        .FirstOrDefault() ?? Guid.NewGuid().ToString();

    ctx.Response.Headers["X-Correlation-Id"] = correlationId;

    using (logger.BeginScope(new Dictionary<string, object>
    {
        ["CorrelationId"] = correlationId
    }))
    {
        await next(ctx);
    }
});

Now every log line within that request includes CorrelationId — you can find all logs for a specific request across services.

Distributed Tracing with OpenTelemetry

Logs tell you what happened on one service. Traces tell you what happened across all services for a single request.

Bash

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.EntityFrameworkCore
dotnet add package OpenTelemetry.Exporter.Jaeger

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("MyApp.*")         // custom activity sources
        .AddJaegerExporter());

Adding Custom Spans

private static readonly ActivitySource _activitySource =
    new ActivitySource("MyApp.OrderService");

public async Task ProcessOrderAsync(Guid orderId, CancellationToken ct)
{
    using var activity = _activitySource.StartActivity("ProcessOrder");
    activity?.SetTag("order.id", orderId);

    try
    {
        await ValidateAsync(orderId, ct);

        using var paymentSpan = _activitySource.StartActivity("ChargePayment");
        await _paymentService.ChargeAsync(orderId, ct);
        paymentSpan?.SetTag("payment.status", "success");

        activity?.SetStatus(ActivityStatusCode.Ok);
    }
    catch (Exception ex)
    {
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        throw;
    }
}

Metrics with dotnet-counters

dotnet-counters is a real-time CLI dashboard for runtime and custom metrics. No deployment needed.

Bash

# Install
dotnet tool install --global dotnet-counters

# Watch a running process
dotnet-counters monitor --process-id <PID> --counters \
  System.Runtime,Microsoft.AspNetCore.Hosting

# Key metrics to watch:
# - cpu-usage
# - gc-heap-size
# - threadpool-queue-length
# - active-requests (ASP.NET Core)
# - requests-per-second

Custom Metrics

using System.Diagnostics.Metrics;

public class OrderMetrics
{
    private readonly Counter<long> _ordersCreated;
    private readonly Histogram<double> _processingTime;

    public OrderMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("MyApp.Orders");
        _ordersCreated  = meter.CreateCounter<long>("orders.created");
        _processingTime = meter.CreateHistogram<double>("orders.processing_ms");
    }

    public void RecordOrderCreated(string region) =>
        _ordersCreated.Add(1, new TagList { { "region", region } });

    public void RecordProcessingTime(double ms) =>
        _processingTime.Record(ms);
}

Performance Profiling with dotnet-trace

Capture a CPU profile or allocation trace from a live process — without attaching a debugger.

Bash

# Install
dotnet tool install --global dotnet-trace

# Capture 30 seconds of CPU profile
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
  --profile cpu-sampling

# Capture GC events + allocations
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
  --clrevents GC,Allocation

# Open the .nettrace file in Visual Studio or PerfView

Memory Dumps

When a process crashes, hangs, or has a suspected memory leak, capture a dump.

Bash

# Install
dotnet tool install --global dotnet-dump

# Capture a dump from a running process
dotnet-dump collect --process-id <PID>

# Analyze
dotnet-dump analyze <dump-file>

# Useful commands inside the analyzer
> gcroots <object-address>    # what's keeping this object alive?
> dumpheap -stat              # heap size by type
> dumpheap -type OrderService # all instances of a type
> threads                     # thread list
> clrstack                    # call stack of current thread

Mini-Profiler (SQL and HTTP Profiling in Dev)

MiniProfiler shows per-request timing for SQL queries, HTTP calls, and custom steps — visible in the browser.

Bash

dotnet add package MiniProfiler.AspNetCore.Mvc
dotnet add package MiniProfiler.EntityFrameworkCore

builder.Services.AddMiniProfiler(options =>
{
    options.RouteBasePath = "/profiler";
    options.SqlFormatter = new InlineFormatter();
}).AddEntityFramework();

// In your Razor layout — shows the profiler widget
@await MiniProfiler.Current.RenderIncludes(ViewContext)

Now every request shows SQL query count, timing, and duplicates inline in the browser. This is how you catch N+1 queries in development before they hit production.

Common Production Failure Patterns

Memory Leak

Symptoms: heap size grows continuously, eventually OOM.

Common causes:

Event handlers not unsubscribed (+= without -=)
Static collections growing forever
IDisposable objects not disposed
Captured closures holding large objects

// LEAK — event handler never removed
_bus.OrderCreated += HandleOrderCreated;

// FIX — remove in Dispose
public void Dispose() => _bus.OrderCreated -= HandleOrderCreated;

Diagnose: dotnet-dump + dumpheap -stat to find the growing type.

Thread Pool Starvation

Symptoms: requests queue up, latency climbs, but CPU is low.

Cause: sync-over-async (.Result, .Wait()) or CPU-bound work on ThreadPool threads, blocking I/O threads.

Diagnose: dotnet-counters — watch threadpool-queue-length. If it climbs, you have starvation.

Bash

dotnet-counters monitor --process-id <PID> \
  --counters System.Runtime[threadpool-queue-length,threadpool-thread-count]

N+1 Query

Symptoms: requests that load a list are slow; SQL profiler shows 1 query per item.

// N+1 — loads all orders, then queries customer for EACH
var orders = await dbContext.Orders.ToListAsync();
foreach (var order in orders)
{
    var customer = await dbContext.Customers.FindAsync(order.CustomerId); // N queries
}

// Fix — eager load with Include
var orders = await dbContext.Orders
    .Include(o => o.Customer)
    .ToListAsync();

Diagnose: MiniProfiler, Serilog with EF Core slow query logging, Application Insights dependency tracking.

Connection Pool Exhaustion

Symptoms: SqlException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.

Cause: DbContext or HttpClient not disposed, or too many concurrent requests.

// FIX for HttpClient — use IHttpClientFactory, never new HttpClient()
builder.Services.AddHttpClient<IProductService, ProductService>();

// FIX for DbContext — use scoped lifetime (default for EF Core + DI)
builder.Services.AddDbContext<AppDbContext>(...); // scoped by default

High GC Pressure

Symptoms: latency spikes every few seconds, high % Time in GC in counters.

Cause: excessive short-lived allocations (strings, byte arrays, LINQ chains in hot paths).

Bash

dotnet-trace collect --process-id <PID> --clrevents GC
# Look for Gen2 GCs — they pause all threads

Fix: ArrayPool<T>, Span<T>, StringBuilder, pre-allocated buffers.

Application Insights / Azure Monitor

In Azure, Application Insights gives you logs, traces, metrics, and exceptions in one place.

builder.Services.AddApplicationInsightsTelemetry(
    builder.Configuration["ApplicationInsights:InstrumentationKey"]);

Key queries (Kusto):

KUSTO

// Slowest requests in last hour
requests
| where timestamp > ago(1h)
| summarize avg(duration), percentile(duration, 95), count() by name
| order by avg_duration desc

// Exceptions by type
exceptions
| where timestamp > ago(1h)
| summarize count() by type
| order by count_ desc

// Failed dependencies
dependencies
| where success == false
| summarize count() by name, type

Incident Response Checklist

When something breaks in production:

Check recent deployments — git log --since="2 hours ago"
Check error rates — are they new or ongoing?
Check logs around the time of first occurrence
Check infrastructure metrics (CPU, memory, disk I/O)
Reproduce in staging if possible — don't debug blind in prod
Roll back if a recent deployment is the cause
Fix forward if rollback is worse than the bug
Write a post-mortem — what happened, why, what we'd do differently

Interview Questions

Q: How do you find a memory leak in a .NET production service? Capture a memory dump with dotnet-dump collect, then analyze with dotnet-dump analyze. Run dumpheap -stat to find types with unexpectedly high instance counts or total size. Use gcroots <address> to find what's keeping objects alive. Common culprits: unheld event subscriptions, static caches with unbounded growth, undisposed resources.

Q: What causes thread pool starvation and how do you detect it? Blocking async calls (.Result, .Wait()) on ThreadPool threads, or long CPU-bound operations. Threads are occupied waiting rather than doing I/O work, so requests queue. Detect with dotnet-counters watching threadpool-queue-length. Fix by making code truly async end-to-end.

Q: How would you diagnose N+1 query problems? Enable EF Core slow query logging or use MiniProfiler in development. Application Insights dependency tracking works in production. Look for many near-identical SQL statements in a single request trace. Fix with Include(), explicit joins, or batching lookups with a Dictionary.

Q: What is distributed tracing and why does it matter? A trace follows a request across multiple services, capturing timing for each hop. It answers "why is this request slow?" when the answer spans services. OpenTelemetry with Jaeger or Zipkin propagates a TraceId in HTTP headers — each service adds spans, and you see the full waterfall.

Q: What is the difference between logging, metrics, and traces? Logs are timestamped events with context (what happened). Metrics are aggregated measurements over time (how many, how fast). Traces follow a request through multiple services (where did the time go). Together they form the three pillars of observability — you need all three for effective production debugging.

Production Debugging in .NET: Mindset, Tools, and Techniques

The Production Debugging Mindset

Structured Logging with Serilog

Log Levels as Signal

Correlation IDs for Request Tracing

Distributed Tracing with OpenTelemetry

Adding Custom Spans

Metrics with dotnet-counters

Custom Metrics

Performance Profiling with dotnet-trace

Memory Dumps

Mini-Profiler (SQL and HTTP Profiling in Dev)

Common Production Failure Patterns

Memory Leak

Thread Pool Starvation

N+1 Query

Connection Pool Exhaustion

High GC Pressure

Application Insights / Azure Monitor

Incident Response Checklist

Interview Questions

Enjoyed this article?

Leave a comment