Learnixo
Back to blog
Backend Systemsadvanced

Production Debugging in .NET: Mindset, Tools, and Techniques

How to diagnose and fix production issues in .NET. Covers structured logging, distributed tracing, memory dumps, dotnet-trace, dotnet-counters, mini-profiler, common failure patterns, and the production debugging mindset.

LearnixoJune 3, 20268 min read
.NETC#DebuggingObservabilityPerformanceProductionInterview
Share:š•

The Production Debugging Mindset

The worst thing you can do in a production incident is start guessing. The best engineers follow a structured approach:

  1. Understand the symptom — what is broken? (latency spike, errors, OOM crash?)
  2. Narrow the scope — when did it start? which environment? which endpoints?
  3. Gather evidence — logs, metrics, traces — before touching anything
  4. Form a hypothesis — one specific theory, falsifiable
  5. Test the hypothesis — one change at a time
  6. Fix and verify — confirm the symptom is gone, not just assumed

Debugging production is about reducing uncertainty, not heroics.


Structured Logging with Serilog

Unstructured logs are grep-able. Structured logs are queryable. The difference matters at scale.

C#
// BAD — unstructured
logger.LogInformation($"Order {orderId} placed by {userId}");

// GOOD — structured (message template with named properties)
logger.LogInformation("Order {OrderId} placed by {UserId}", orderId, userId);

The second form creates a log event with OrderId and UserId as searchable properties in Seq, Elastic, or Application Insights — not just a formatted string.

Log Levels as Signal

| Level | When to use | |---|---| | Trace | Extremely detailed, dev only | | Debug | Diagnostic info, typically disabled in prod | | Information | Normal business events (order placed, user logged in) | | Warning | Something unexpected but recoverable | | Error | Operation failed, needs attention | | Critical | System is unusable, immediate action required |

C#
// Log at the right level
logger.LogInformation("Payment processed for OrderId {OrderId}", orderId);
logger.LogWarning("Payment retry {Attempt} for OrderId {OrderId}", attempt, orderId);
logger.LogError(ex, "Payment failed for OrderId {OrderId}", orderId);

Correlation IDs for Request Tracing

C#
// Middleware — assign or forward a correlation ID
app.Use(async (ctx, next) =>
{
    var correlationId = ctx.Request.Headers["X-Correlation-Id"]
        .FirstOrDefault() ?? Guid.NewGuid().ToString();

    ctx.Response.Headers["X-Correlation-Id"] = correlationId;

    using (logger.BeginScope(new Dictionary<string, object>
    {
        ["CorrelationId"] = correlationId
    }))
    {
        await next(ctx);
    }
});

Now every log line within that request includes CorrelationId — you can find all logs for a specific request across services.


Distributed Tracing with OpenTelemetry

Logs tell you what happened on one service. Traces tell you what happened across all services for a single request.

Bash
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.EntityFrameworkCore
dotnet add package OpenTelemetry.Exporter.Jaeger
C#
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("MyApp.*")         // custom activity sources
        .AddJaegerExporter());

Adding Custom Spans

C#
private static readonly ActivitySource _activitySource =
    new ActivitySource("MyApp.OrderService");

public async Task ProcessOrderAsync(Guid orderId, CancellationToken ct)
{
    using var activity = _activitySource.StartActivity("ProcessOrder");
    activity?.SetTag("order.id", orderId);

    try
    {
        await ValidateAsync(orderId, ct);

        using var paymentSpan = _activitySource.StartActivity("ChargePayment");
        await _paymentService.ChargeAsync(orderId, ct);
        paymentSpan?.SetTag("payment.status", "success");

        activity?.SetStatus(ActivityStatusCode.Ok);
    }
    catch (Exception ex)
    {
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        throw;
    }
}

Metrics with dotnet-counters

dotnet-counters is a real-time CLI dashboard for runtime and custom metrics. No deployment needed.

Bash
# Install
dotnet tool install --global dotnet-counters

# Watch a running process
dotnet-counters monitor --process-id <PID> --counters \
  System.Runtime,Microsoft.AspNetCore.Hosting

# Key metrics to watch:
# - cpu-usage
# - gc-heap-size
# - threadpool-queue-length
# - active-requests (ASP.NET Core)
# - requests-per-second

Custom Metrics

C#
using System.Diagnostics.Metrics;

public class OrderMetrics
{
    private readonly Counter<long> _ordersCreated;
    private readonly Histogram<double> _processingTime;

    public OrderMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("MyApp.Orders");
        _ordersCreated  = meter.CreateCounter<long>("orders.created");
        _processingTime = meter.CreateHistogram<double>("orders.processing_ms");
    }

    public void RecordOrderCreated(string region) =>
        _ordersCreated.Add(1, new TagList { { "region", region } });

    public void RecordProcessingTime(double ms) =>
        _processingTime.Record(ms);
}

Performance Profiling with dotnet-trace

Capture a CPU profile or allocation trace from a live process — without attaching a debugger.

Bash
# Install
dotnet tool install --global dotnet-trace

# Capture 30 seconds of CPU profile
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
  --profile cpu-sampling

# Capture GC events + allocations
dotnet-trace collect --process-id <PID> --duration 00:00:30 \
  --clrevents GC,Allocation

# Open the .nettrace file in Visual Studio or PerfView

Memory Dumps

When a process crashes, hangs, or has a suspected memory leak, capture a dump.

Bash
# Install
dotnet tool install --global dotnet-dump

# Capture a dump from a running process
dotnet-dump collect --process-id <PID>

# Analyze
dotnet-dump analyze <dump-file>

# Useful commands inside the analyzer
> gcroots <object-address>    # what's keeping this object alive?
> dumpheap -stat              # heap size by type
> dumpheap -type OrderService # all instances of a type
> threads                     # thread list
> clrstack                    # call stack of current thread

Mini-Profiler (SQL and HTTP Profiling in Dev)

MiniProfiler shows per-request timing for SQL queries, HTTP calls, and custom steps — visible in the browser.

Bash
dotnet add package MiniProfiler.AspNetCore.Mvc
dotnet add package MiniProfiler.EntityFrameworkCore
C#
builder.Services.AddMiniProfiler(options =>
{
    options.RouteBasePath = "/profiler";
    options.SqlFormatter = new InlineFormatter();
}).AddEntityFramework();

// In your Razor layout — shows the profiler widget
@await MiniProfiler.Current.RenderIncludes(ViewContext)

Now every request shows SQL query count, timing, and duplicates inline in the browser. This is how you catch N+1 queries in development before they hit production.


Common Production Failure Patterns

Memory Leak

Symptoms: heap size grows continuously, eventually OOM.

Common causes:

  • Event handlers not unsubscribed (+= without -=)
  • Static collections growing forever
  • IDisposable objects not disposed
  • Captured closures holding large objects
C#
// LEAK — event handler never removed
_bus.OrderCreated += HandleOrderCreated;

// FIX — remove in Dispose
public void Dispose() => _bus.OrderCreated -= HandleOrderCreated;

Diagnose: dotnet-dump + dumpheap -stat to find the growing type.

Thread Pool Starvation

Symptoms: requests queue up, latency climbs, but CPU is low.

Cause: sync-over-async (.Result, .Wait()) or CPU-bound work on ThreadPool threads, blocking I/O threads.

Diagnose: dotnet-counters — watch threadpool-queue-length. If it climbs, you have starvation.

Bash
dotnet-counters monitor --process-id <PID> \
  --counters System.Runtime[threadpool-queue-length,threadpool-thread-count]

N+1 Query

Symptoms: requests that load a list are slow; SQL profiler shows 1 query per item.

C#
// N+1 — loads all orders, then queries customer for EACH
var orders = await dbContext.Orders.ToListAsync();
foreach (var order in orders)
{
    var customer = await dbContext.Customers.FindAsync(order.CustomerId); // N queries
}

// Fix — eager load with Include
var orders = await dbContext.Orders
    .Include(o => o.Customer)
    .ToListAsync();

Diagnose: MiniProfiler, Serilog with EF Core slow query logging, Application Insights dependency tracking.

Connection Pool Exhaustion

Symptoms: SqlException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.

Cause: DbContext or HttpClient not disposed, or too many concurrent requests.

C#
// FIX for HttpClient — use IHttpClientFactory, never new HttpClient()
builder.Services.AddHttpClient<IProductService, ProductService>();

// FIX for DbContext — use scoped lifetime (default for EF Core + DI)
builder.Services.AddDbContext<AppDbContext>(...); // scoped by default

High GC Pressure

Symptoms: latency spikes every few seconds, high % Time in GC in counters.

Cause: excessive short-lived allocations (strings, byte arrays, LINQ chains in hot paths).

Bash
dotnet-trace collect --process-id <PID> --clrevents GC
# Look for Gen2 GCs — they pause all threads

Fix: ArrayPool<T>, Span<T>, StringBuilder, pre-allocated buffers.


Application Insights / Azure Monitor

In Azure, Application Insights gives you logs, traces, metrics, and exceptions in one place.

C#
builder.Services.AddApplicationInsightsTelemetry(
    builder.Configuration["ApplicationInsights:InstrumentationKey"]);

Key queries (Kusto):

KUSTO
// Slowest requests in last hour
requests
| where timestamp > ago(1h)
| summarize avg(duration), percentile(duration, 95), count() by name
| order by avg_duration desc

// Exceptions by type
exceptions
| where timestamp > ago(1h)
| summarize count() by type
| order by count_ desc

// Failed dependencies
dependencies
| where success == false
| summarize count() by name, type

Incident Response Checklist

When something breaks in production:

  1. Check recent deployments — git log --since="2 hours ago"
  2. Check error rates — are they new or ongoing?
  3. Check logs around the time of first occurrence
  4. Check infrastructure metrics (CPU, memory, disk I/O)
  5. Reproduce in staging if possible — don't debug blind in prod
  6. Roll back if a recent deployment is the cause
  7. Fix forward if rollback is worse than the bug
  8. Write a post-mortem — what happened, why, what we'd do differently

Interview Questions

Q: How do you find a memory leak in a .NET production service? Capture a memory dump with dotnet-dump collect, then analyze with dotnet-dump analyze. Run dumpheap -stat to find types with unexpectedly high instance counts or total size. Use gcroots <address> to find what's keeping objects alive. Common culprits: unheld event subscriptions, static caches with unbounded growth, undisposed resources.

Q: What causes thread pool starvation and how do you detect it? Blocking async calls (.Result, .Wait()) on ThreadPool threads, or long CPU-bound operations. Threads are occupied waiting rather than doing I/O work, so requests queue. Detect with dotnet-counters watching threadpool-queue-length. Fix by making code truly async end-to-end.

Q: How would you diagnose N+1 query problems? Enable EF Core slow query logging or use MiniProfiler in development. Application Insights dependency tracking works in production. Look for many near-identical SQL statements in a single request trace. Fix with Include(), explicit joins, or batching lookups with a Dictionary.

Q: What is distributed tracing and why does it matter? A trace follows a request across multiple services, capturing timing for each hop. It answers "why is this request slow?" when the answer spans services. OpenTelemetry with Jaeger or Zipkin propagates a TraceId in HTTP headers — each service adds spans, and you see the full waterfall.

Q: What is the difference between logging, metrics, and traces? Logs are timestamped events with context (what happened). Metrics are aggregated measurements over time (how many, how fast). Traces follow a request through multiple services (where did the time go). Together they form the three pillars of observability — you need all three for effective production debugging.

Enjoyed this article?

Explore the Backend Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.