Learnixo

.NET & C# Development · Lesson 227 of 229

Incident Management for .NET Teams — On-Call, Runbooks, and Postmortems

The Cost of Incident Chaos

Without a defined incident process, the same failure costs you twice: once in downtime and once in the hours engineers spend frantically Slack-messaging each other trying to find the right person, the right logs, and the right playbook. The second cost is often larger than the first.

A mature incident practice is not bureaucracy — it is a forcing function that makes every future incident cheaper to resolve than the last.

This article covers the full lifecycle of incident management for .NET-based backend systems:

  1. Incident detection and alerting setup
  2. Triage and communication patterns
  3. C# patterns that make incidents faster to diagnose
  4. Writing runbooks that actually get used
  5. Postmortems that produce lasting improvements

The Incident Lifecycle

Detection → Triage → Mitigation → Resolution → Postmortem

| Phase | Goal | Key Output | |---|---|---| | Detection | Know before users tell you | Alert firing in < 5 min of failure | | Triage | Understand scope and severity | Incident severity declared (P1–P4) | | Mitigation | Stop the bleeding | Users no longer impacted | | Resolution | Fix root cause | System fully restored, monitoring green | | Postmortem | Learn and improve | Written document + action items |

Mitigation ≠ Resolution. Restarting a crashing pod mitigates the incident (users can proceed) but the root cause is still unknown. Resolution happens when you understand and fix why it crashed.


Severity Definitions

Agree on severity levels before incidents happen. A common P1–P4 scale for .NET API services:

| Severity | Definition | Response | Example | |---|---|---|---| | P1 | Service completely unavailable or data loss | Immediate, page on-call | All POST /orders returning 500 | | P2 | Significant degradation affecting many users | Page on-call, 15-min SLA | p99 latency > 10s, auth broken for subset | | P3 | Partial degradation, workaround exists | Ticket, fix within 24 h | PDF export failing, other flows work | | P4 | Minor, cosmetic, low impact | Ticket, fix in next sprint | Wrong timezone on email timestamps |


On-Call Setup for .NET Services

Your alerting stack choices matter less than alerting configuration. A well-configured Azure Monitor or Grafana OnCall beats a poorly configured PagerDuty.

Key Alerting Rules for ASP.NET Core Services

These Prometheus alert rules cover the most common failure modes:

YAML
# incident-alerts.yaml
groups:
  - name: dotnet-api-incidents
    rules:

      # ── P1: service down ────────────────────────────────────────────────
      - alert: ServiceDown
        expr: up{job="dotnet-api"} == 0
        for: 1m
        labels:
          severity: critical
          page:     "true"
        annotations:
          summary:  "{{ $labels.instance }} is unreachable"
          runbook:  "https://wiki.internal/runbooks/service-down"

      # ── P1: error rate spike ─────────────────────────────────────────────
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_errors_total[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary:  "Error rate > 5% on {{ $labels.service }}"
          runbook:  "https://wiki.internal/runbooks/high-error-rate"

      # ── P2: latency degradation ──────────────────────────────────────────
      - alert: LatencyDegradation
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary:  "p99 latency > 2s on {{ $labels.service }}"
          runbook:  "https://wiki.internal/runbooks/latency-degradation"

Azure Monitor Equivalent

If you are on Azure Monitor rather than Prometheus, the equivalent metric alert for 5xx rate in Application Insights:

JSON
{
  "name": "HighServerErrorRate",
  "criteria": {
    "metricName": "requests/failed",
    "operator": "GreaterThan",
    "threshold": 0.05,
    "timeAggregation": "Average",
    "evaluationFrequency": "PT1M",
    "windowSize": "PT5M"
  },
  "actions": [{ "actionGroupId": "/subscriptions/.../actionGroups/on-call" }]
}

C# Pattern: IncidentCorrelationMiddleware

The most powerful debugging tool during an incident is a single correlation ID that ties together:

  • The inbound HTTP request
  • All downstream HTTP calls made by that request
  • All log lines emitted during that request
  • The response returned to the caller

Without this, you are hunting through logs by timestamp — slow, error-prone, and ineffective under pressure.

C#
namespace YourApp.Incident;

/// <summary>
/// Stamps every request with a correlation ID.
/// The ID propagates to downstream services via the X-Correlation-Id header
/// and appears in every log line and the response header.
///
/// If the caller provides X-Correlation-Id (from their own system),
/// we honour it to preserve end-to-end tracing across system boundaries.
/// </summary>
public sealed class IncidentCorrelationMiddleware(RequestDelegate next)
{
    private const string HeaderName = "X-Correlation-Id";

    public async Task InvokeAsync(HttpContext ctx, ILogger<IncidentCorrelationMiddleware> logger)
    {
        // Accept upstream correlation ID or generate a new one
        var correlationId = ctx.Request.Headers[HeaderName].FirstOrDefault()
                            ?? Activity.Current?.TraceId.ToString()
                            ?? Guid.NewGuid().ToString("N");

        // Expose on the current Activity so OpenTelemetry picks it up
        Activity.Current?.SetBaggage("correlation.id", correlationId);

        // Make available via DI/scoped services
        ctx.Items["CorrelationId"] = correlationId;

        // Echo back in the response — callers use this to file bug reports
        ctx.Response.OnStarting(() =>
        {
            ctx.Response.Headers[HeaderName] = correlationId;
            return Task.CompletedTask;
        });

        // Push into the logging scope so every log line in this request
        // automatically carries the correlation ID
        using var scope = logger.BeginScope(new Dictionary<string, object>
        {
            ["CorrelationId"] = correlationId,
            ["RequestPath"]   = ctx.Request.Path.Value ?? string.Empty,
            ["RequestMethod"] = ctx.Request.Method,
        });

        await next(ctx);
    }
}

Register it as the first middleware (before routing, before auth):

C#
// Program.cs — order matters: correlation goes first
app.UseMiddleware<IncidentCorrelationMiddleware>();
app.UseRouting();
app.UseAuthentication();
app.UseAuthorization();

Propagating the Correlation ID to Downstream Services

Use a DelegatingHandler on HttpClient so the ID flows automatically:

C#
namespace YourApp.Incident;

public sealed class CorrelationIdPropagationHandler(IHttpContextAccessor accessor)
    : DelegatingHandler
{
    protected override Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request,
        CancellationToken cancellationToken)
    {
        var correlationId = accessor.HttpContext?.Items["CorrelationId"]?.ToString()
                            ?? Activity.Current?.GetBaggageItem("correlation.id")
                            ?? Guid.NewGuid().ToString("N");

        request.Headers.TryAddWithoutValidation("X-Correlation-Id", correlationId);
        return base.SendAsync(request, cancellationToken);
    }
}

// Registration
builder.Services.AddHttpContextAccessor();
builder.Services.AddTransient<CorrelationIdPropagationHandler>();
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .AddHttpMessageHandler<CorrelationIdPropagationHandler>();

Actionable Health Check Endpoints

The default ASP.NET Core health check returns "Healthy" or "Unhealthy" — useless during an incident. You need structured JSON that tells on-call engineers exactly what is wrong without requiring them to log into multiple systems first.

C#
namespace YourApp.Incident;

/// <summary>
/// Health check that exposes structured diagnostic data.
/// Useful during incidents: the on-call engineer hits /health/detail
/// and gets a clear picture without needing additional access.
///
/// Runbook: https://wiki.internal/runbooks/health-check-failures
/// </summary>
public sealed class DatabaseHealthCheck(
    AppDbContext db,
    ILogger<DatabaseHealthCheck> logger) : IHealthCheck
{
    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var sw = Stopwatch.StartNew();
        try
        {
            // A real query, not just "can we open a connection"
            var canQuery = await db.Database
                .ExecuteSqlRawAsync("SELECT 1", cancellationToken) >= 0;
            sw.Stop();

            if (!canQuery)
                return HealthCheckResult.Unhealthy("Database returned unexpected result");

            var data = new Dictionary<string, object>
            {
                ["latency_ms"]    = sw.ElapsedMilliseconds,
                ["database_name"] = db.Database.GetDbConnection().Database,
                ["server"]        = db.Database.GetDbConnection().DataSource,
                ["checked_at"]    = DateTimeOffset.UtcNow,
            };

            return sw.ElapsedMilliseconds > 200
                ? HealthCheckResult.Degraded($"Database responding but slow ({sw.ElapsedMilliseconds} ms)", data: data)
                : HealthCheckResult.Healthy("Database healthy", data);
        }
        catch (Exception ex)
        {
            sw.Stop();
            logger.LogError(ex, "Database health check failed after {Ms} ms", sw.ElapsedMilliseconds);

            return HealthCheckResult.Unhealthy(
                description: ex.Message,
                exception: ex,
                data: new Dictionary<string, object>
                {
                    ["error"]      = ex.GetType().Name,
                    ["message"]    = ex.Message,
                    ["latency_ms"] = sw.ElapsedMilliseconds,
                    ["checked_at"] = DateTimeOffset.UtcNow,
                });
        }
    }
}

Expose as structured JSON (not the basic liveness format):

C#
app.MapHealthChecks("/health/detail", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
    AllowCachingResponses = false,
});

// Liveness — fast, no external dependencies
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false   // Built-in check only — always fast
});

The /health/detail response during a database outage looks like:

JSON
{
  "status": "Unhealthy",
  "checks": [{
    "name": "database",
    "status": "Unhealthy",
    "description": "Connection refused (localhost:5432)",
    "data": {
      "error": "NpgsqlException",
      "message": "Connection refused (localhost:5432)",
      "latency_ms": 5003,
      "checked_at": "2026-05-26T14:32:10Z"
    }
  }]
}

Your on-call engineer sees "NpgsqlException, connection refused, localhost:5432" in 3 seconds. Without structured health checks they would spend 10 minutes checking dashboards to arrive at the same conclusion.


Writing Runbooks That Get Used

A runbook that on-call engineers skip is not a runbook — it is documentation theater. Good runbooks share four properties:

1. They link from the alert, not from a wiki index. The alert annotation carries the URL. Engineers follow it immediately without searching.

2. They start with the diagnosis, not the background. No one reads three paragraphs of architecture history at 2 AM. Start with "Check this first."

3. They are command-first. Every action is a command to copy-paste, not a description of a command.

4. They include expected output. Engineers need to know when they have succeeded.

Runbook Template

MARKDOWN
# Runbook: High Error Rate on Orders Service

**Alert**: HighErrorRate  
**Severity**: P2  
**Owner**: Platform Team  
**Last reviewed**: 2026-05-01

## What triggered this alert
Error rate > 5% on the orders-api for more than 2 minutes.

## Step 1  Check which endpoints are failing

kubectl logs -l app=orders-api --since=10m | grep '"level":"error"' | \
  jq -r '"\(.RequestPath) \(.StatusCode) \(.Exception)"' | sort | uniq -c | sort -rn

Expected: you see a handful of paths with repeated exceptions.

## Step 2  Check database connectivity

kubectl exec -it deploy/orders-api -- dotnet-health-cli /health/detail

Expected: database check shows "Healthy". If it shows "Unhealthy":
 Follow runbook: https://wiki.internal/runbooks/database-connectivity

## Step 3  Check recent deployments

kubectl rollout history deploy/orders-api

If a deployment happened within the last 30 minutes:
kubectl rollout undo deploy/orders-api
 Watch error rate drop in Grafana within 2 minutes.

## Step 4  Escalate

If none of the above resolves the alert within 20 minutes, escalate to on-call lead.
Page: https://app.pagerduty.com/incidents/new?service=orders-api

Runbook Comments in Code

Link runbooks directly from the code that could trigger the issue. Future on-call engineers (and your future self) will thank you.

C#
public sealed class OrderService(IOrderRepository repo, ILogger<OrderService> logger)
{
    /// <summary>
    /// Creates a new order and reserves inventory.
    ///
    /// INCIDENT NOTE: If this method starts throwing TimeoutException at scale,
    /// the likely cause is database connection pool exhaustion.
    /// Runbook: https://wiki.internal/runbooks/db-pool-exhaustion
    ///
    /// Known causes:
    ///  - Long-running transactions holding connections (check EF Core interceptor logs)
    ///  - Sudden traffic spike without connection pool scaling
    ///  - Deadlock in inventory reservation — see issue #1234
    /// </summary>
    public async Task<OrderResult> CreateOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
    {
        using var activity = ActivitySource.StartActivity("OrderService.CreateOrder");
        activity?.SetTag("order.customer_id", cmd.CustomerId);
        activity?.SetTag("order.item_count", cmd.Items.Count);

        try
        {
            var order = await repo.CreateAsync(cmd, ct);
            logger.LogInformation("Order {OrderId} created for customer {CustomerId}",
                order.Id, cmd.CustomerId);
            return OrderResult.Success(order);
        }
        catch (TimeoutException ex)
        {
            // This is the most common P2 failure mode for this service.
            // See runbook: https://wiki.internal/runbooks/db-pool-exhaustion
            logger.LogError(ex,
                "Timeout creating order for customer {CustomerId} — likely pool exhaustion",
                cmd.CustomerId);
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            throw;
        }
    }

    private static readonly ActivitySource ActivitySource =
        new("YourApp.Orders", "1.0.0");
}

Structured Logging for Incident Correlation

During an incident you need to find all logs for a failing user, request, or operation within seconds. This requires structured logging with consistent field names.

Using Serilog with structured properties:

C#
// Program.cs
builder.Host.UseSerilog((ctx, cfg) =>
{
    cfg.ReadFrom.Configuration(ctx.Configuration)
       .Enrich.FromLogContext()
       .Enrich.WithMachineName()
       .Enrich.WithEnvironmentName()
       .WriteTo.Console(new CompactJsonFormatter());  // JSON in prod for log aggregation
});

In your services, always log the key identifiers that let you reconstruct an incident timeline:

C#
// Good: structured, searchable
logger.LogError(ex,
    "Payment failed for order {OrderId}, customer {CustomerId}, amount {Amount:C}",
    orderId, customerId, amount);

// Bad: string interpolation loses the structure
logger.LogError($"Payment failed: {orderId} {customerId} {amount}");

The first form lets you search Seq / Elasticsearch / Azure Monitor for OrderId = "abc123" and get every log line for that order across every service that logged it — including the correlation ID.


The Postmortem — Blameless and Actionable

A blameless postmortem assumes that the engineers involved acted with the information and tools available to them at the time. The goal is not to assign blame but to improve the system.

Postmortem Structure

MARKDOWN
# Postmortem: Orders API P1  2026-05-20

**Severity**: P1  
**Duration**: 47 minutes (14:22  15:09 UTC)  
**Impact**: ~3,200 users unable to place orders; ~$140,000 GMV at risk  
**Incident Commander**: [Name]  
**Scribe**: [Name]

## Summary
A misconfigured connection string deployed in release v2.14.1 caused
the orders-api to fail all database writes starting at 14:22. The
issue was detected by the HighErrorRate alert at 14:27 (5-minute lag).
Rollback was completed at 15:09.

## Timeline

| Time (UTC) | Event |
|---|---|
| 14:15 | v2.14.1 deployed to production (automated canary, 10% traffic) |
| 14:22 | First 5xx errors appear in Grafana |
| 14:27 | HighErrorRate alert fires; on-call paged |
| 14:32 | On-call acknowledges, begins triage |
| 14:45 | Root cause identified (connection string) |
| 15:02 | Rollback initiated |
| 15:09 | All services healthy, alert resolved |

## Root Cause
The staging connection string was included in the production deployment
because the CI pipeline environment variable substitution was not validated
before artifact promotion. The staging DB has different credentials; all
writes failed with authentication errors.

## Contributing Factors
1. No automated smoke test verifying DB connectivity before canary promotion
2. Canary traffic was 10%  enough to trigger the alert but slow to detect
3. Health check did not check write access, only read (`SELECT 1`)

## What Went Well
- Alert fired within 5 minutes of first errors
- Runbook was followed accurately; no improvisation needed
- Rollback was clean and took < 7 minutes

## Action Items

| Item | Owner | Due |
|---|---|---|
| Add DB write smoke test to CI pipeline pre-promotion gate | @platform | 2026-06-03 |
| Update health check to verify write access with a no-op write | @platform | 2026-05-28 |
| Add env var validation step to deployment pipeline | @devops | 2026-06-03 |
| Reduce canary window from 10 min to 3 min for write-path changes | @platform | 2026-06-10 |

## Lessons Learned
- Write-path issues need faster canary promotion gates than read-path issues.
- Our health check gave false confidence  "Healthy" does not mean "can write".
- Environment variable substitution validation is a standard CI step we were missing.

The Most Important Rule: Follow Up on Action Items

A postmortem with no closed action items is worse than no postmortem — it erodes trust in the process. Assign every item to a named person with a specific date. Review open items weekly until closed.


Incident Communication

During a P1, communication clarity prevents secondary chaos:

Initial notice (within 5 minutes of detection):

"We are investigating elevated error rates on the orders API. Users may experience failures when placing orders. We will update in 15 minutes."

Updates every 15 minutes:

"Update: root cause identified as a database configuration issue. Mitigation in progress. ETA 20 minutes."

Resolution:

"Resolved at 15:09 UTC. Root cause: misconfigured connection string in release v2.14.1. Rollback complete. All services healthy. Postmortem will be published within 48 hours."

Keep the channel (Slack, Teams, status page) updated even when you have nothing new to say. "Still investigating" every 15 minutes is better than silence.


Key Takeaways

  • Correlation IDs in every request are the single highest-leverage investment for incident diagnosis. Implement IncidentCorrelationMiddleware before anything else.
  • Actionable health checks cut triage time from minutes to seconds. Return structured JSON with last error and timestamp, not just "Unhealthy".
  • Runbooks linked from alerts are used; runbooks in a wiki index are ignored.
  • Postmortems are only valuable if action items close. Track them in your sprint board, not a separate document.
  • Blameless culture requires explicit reinforcement in every postmortem. Start the meeting with "we are here to improve the system, not assign fault."