.NET & C# Development · Lesson 227 of 229
Incident Management for .NET Teams — On-Call, Runbooks, and Postmortems
The Cost of Incident Chaos
Without a defined incident process, the same failure costs you twice: once in downtime and once in the hours engineers spend frantically Slack-messaging each other trying to find the right person, the right logs, and the right playbook. The second cost is often larger than the first.
A mature incident practice is not bureaucracy — it is a forcing function that makes every future incident cheaper to resolve than the last.
This article covers the full lifecycle of incident management for .NET-based backend systems:
- Incident detection and alerting setup
- Triage and communication patterns
- C# patterns that make incidents faster to diagnose
- Writing runbooks that actually get used
- Postmortems that produce lasting improvements
The Incident Lifecycle
Detection → Triage → Mitigation → Resolution → Postmortem| Phase | Goal | Key Output | |---|---|---| | Detection | Know before users tell you | Alert firing in < 5 min of failure | | Triage | Understand scope and severity | Incident severity declared (P1–P4) | | Mitigation | Stop the bleeding | Users no longer impacted | | Resolution | Fix root cause | System fully restored, monitoring green | | Postmortem | Learn and improve | Written document + action items |
Mitigation ≠ Resolution. Restarting a crashing pod mitigates the incident (users can proceed) but the root cause is still unknown. Resolution happens when you understand and fix why it crashed.
Severity Definitions
Agree on severity levels before incidents happen. A common P1–P4 scale for .NET API services:
| Severity | Definition | Response | Example |
|---|---|---|---|
| P1 | Service completely unavailable or data loss | Immediate, page on-call | All POST /orders returning 500 |
| P2 | Significant degradation affecting many users | Page on-call, 15-min SLA | p99 latency > 10s, auth broken for subset |
| P3 | Partial degradation, workaround exists | Ticket, fix within 24 h | PDF export failing, other flows work |
| P4 | Minor, cosmetic, low impact | Ticket, fix in next sprint | Wrong timezone on email timestamps |
On-Call Setup for .NET Services
Your alerting stack choices matter less than alerting configuration. A well-configured Azure Monitor or Grafana OnCall beats a poorly configured PagerDuty.
Key Alerting Rules for ASP.NET Core Services
These Prometheus alert rules cover the most common failure modes:
# incident-alerts.yaml
groups:
- name: dotnet-api-incidents
rules:
# ── P1: service down ────────────────────────────────────────────────
- alert: ServiceDown
expr: up{job="dotnet-api"} == 0
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "{{ $labels.instance }} is unreachable"
runbook: "https://wiki.internal/runbooks/service-down"
# ── P1: error rate spike ─────────────────────────────────────────────
- alert: HighErrorRate
expr: |
sum(rate(http_requests_errors_total[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 5% on {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
# ── P2: latency degradation ──────────────────────────────────────────
- alert: LatencyDegradation
expr: |
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency > 2s on {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/latency-degradation"Azure Monitor Equivalent
If you are on Azure Monitor rather than Prometheus, the equivalent metric alert for 5xx rate in Application Insights:
{
"name": "HighServerErrorRate",
"criteria": {
"metricName": "requests/failed",
"operator": "GreaterThan",
"threshold": 0.05,
"timeAggregation": "Average",
"evaluationFrequency": "PT1M",
"windowSize": "PT5M"
},
"actions": [{ "actionGroupId": "/subscriptions/.../actionGroups/on-call" }]
}C# Pattern: IncidentCorrelationMiddleware
The most powerful debugging tool during an incident is a single correlation ID that ties together:
- The inbound HTTP request
- All downstream HTTP calls made by that request
- All log lines emitted during that request
- The response returned to the caller
Without this, you are hunting through logs by timestamp — slow, error-prone, and ineffective under pressure.
namespace YourApp.Incident;
/// <summary>
/// Stamps every request with a correlation ID.
/// The ID propagates to downstream services via the X-Correlation-Id header
/// and appears in every log line and the response header.
///
/// If the caller provides X-Correlation-Id (from their own system),
/// we honour it to preserve end-to-end tracing across system boundaries.
/// </summary>
public sealed class IncidentCorrelationMiddleware(RequestDelegate next)
{
private const string HeaderName = "X-Correlation-Id";
public async Task InvokeAsync(HttpContext ctx, ILogger<IncidentCorrelationMiddleware> logger)
{
// Accept upstream correlation ID or generate a new one
var correlationId = ctx.Request.Headers[HeaderName].FirstOrDefault()
?? Activity.Current?.TraceId.ToString()
?? Guid.NewGuid().ToString("N");
// Expose on the current Activity so OpenTelemetry picks it up
Activity.Current?.SetBaggage("correlation.id", correlationId);
// Make available via DI/scoped services
ctx.Items["CorrelationId"] = correlationId;
// Echo back in the response — callers use this to file bug reports
ctx.Response.OnStarting(() =>
{
ctx.Response.Headers[HeaderName] = correlationId;
return Task.CompletedTask;
});
// Push into the logging scope so every log line in this request
// automatically carries the correlation ID
using var scope = logger.BeginScope(new Dictionary<string, object>
{
["CorrelationId"] = correlationId,
["RequestPath"] = ctx.Request.Path.Value ?? string.Empty,
["RequestMethod"] = ctx.Request.Method,
});
await next(ctx);
}
}Register it as the first middleware (before routing, before auth):
// Program.cs — order matters: correlation goes first
app.UseMiddleware<IncidentCorrelationMiddleware>();
app.UseRouting();
app.UseAuthentication();
app.UseAuthorization();Propagating the Correlation ID to Downstream Services
Use a DelegatingHandler on HttpClient so the ID flows automatically:
namespace YourApp.Incident;
public sealed class CorrelationIdPropagationHandler(IHttpContextAccessor accessor)
: DelegatingHandler
{
protected override Task<HttpResponseMessage> SendAsync(
HttpRequestMessage request,
CancellationToken cancellationToken)
{
var correlationId = accessor.HttpContext?.Items["CorrelationId"]?.ToString()
?? Activity.Current?.GetBaggageItem("correlation.id")
?? Guid.NewGuid().ToString("N");
request.Headers.TryAddWithoutValidation("X-Correlation-Id", correlationId);
return base.SendAsync(request, cancellationToken);
}
}
// Registration
builder.Services.AddHttpContextAccessor();
builder.Services.AddTransient<CorrelationIdPropagationHandler>();
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.AddHttpMessageHandler<CorrelationIdPropagationHandler>();Actionable Health Check Endpoints
The default ASP.NET Core health check returns "Healthy" or "Unhealthy" — useless during an incident. You need structured JSON that tells on-call engineers exactly what is wrong without requiring them to log into multiple systems first.
namespace YourApp.Incident;
/// <summary>
/// Health check that exposes structured diagnostic data.
/// Useful during incidents: the on-call engineer hits /health/detail
/// and gets a clear picture without needing additional access.
///
/// Runbook: https://wiki.internal/runbooks/health-check-failures
/// </summary>
public sealed class DatabaseHealthCheck(
AppDbContext db,
ILogger<DatabaseHealthCheck> logger) : IHealthCheck
{
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
var sw = Stopwatch.StartNew();
try
{
// A real query, not just "can we open a connection"
var canQuery = await db.Database
.ExecuteSqlRawAsync("SELECT 1", cancellationToken) >= 0;
sw.Stop();
if (!canQuery)
return HealthCheckResult.Unhealthy("Database returned unexpected result");
var data = new Dictionary<string, object>
{
["latency_ms"] = sw.ElapsedMilliseconds,
["database_name"] = db.Database.GetDbConnection().Database,
["server"] = db.Database.GetDbConnection().DataSource,
["checked_at"] = DateTimeOffset.UtcNow,
};
return sw.ElapsedMilliseconds > 200
? HealthCheckResult.Degraded($"Database responding but slow ({sw.ElapsedMilliseconds} ms)", data: data)
: HealthCheckResult.Healthy("Database healthy", data);
}
catch (Exception ex)
{
sw.Stop();
logger.LogError(ex, "Database health check failed after {Ms} ms", sw.ElapsedMilliseconds);
return HealthCheckResult.Unhealthy(
description: ex.Message,
exception: ex,
data: new Dictionary<string, object>
{
["error"] = ex.GetType().Name,
["message"] = ex.Message,
["latency_ms"] = sw.ElapsedMilliseconds,
["checked_at"] = DateTimeOffset.UtcNow,
});
}
}
}Expose as structured JSON (not the basic liveness format):
app.MapHealthChecks("/health/detail", new HealthCheckOptions
{
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
AllowCachingResponses = false,
});
// Liveness — fast, no external dependencies
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = _ => false // Built-in check only — always fast
});The /health/detail response during a database outage looks like:
{
"status": "Unhealthy",
"checks": [{
"name": "database",
"status": "Unhealthy",
"description": "Connection refused (localhost:5432)",
"data": {
"error": "NpgsqlException",
"message": "Connection refused (localhost:5432)",
"latency_ms": 5003,
"checked_at": "2026-05-26T14:32:10Z"
}
}]
}Your on-call engineer sees "NpgsqlException, connection refused, localhost:5432" in 3 seconds. Without structured health checks they would spend 10 minutes checking dashboards to arrive at the same conclusion.
Writing Runbooks That Get Used
A runbook that on-call engineers skip is not a runbook — it is documentation theater. Good runbooks share four properties:
1. They link from the alert, not from a wiki index. The alert annotation carries the URL. Engineers follow it immediately without searching.
2. They start with the diagnosis, not the background. No one reads three paragraphs of architecture history at 2 AM. Start with "Check this first."
3. They are command-first. Every action is a command to copy-paste, not a description of a command.
4. They include expected output. Engineers need to know when they have succeeded.
Runbook Template
# Runbook: High Error Rate on Orders Service
**Alert**: HighErrorRate
**Severity**: P2
**Owner**: Platform Team
**Last reviewed**: 2026-05-01
## What triggered this alert
Error rate > 5% on the orders-api for more than 2 minutes.
## Step 1 — Check which endpoints are failing
kubectl logs -l app=orders-api --since=10m | grep '"level":"error"' | \
jq -r '"\(.RequestPath) \(.StatusCode) \(.Exception)"' | sort | uniq -c | sort -rn
Expected: you see a handful of paths with repeated exceptions.
## Step 2 — Check database connectivity
kubectl exec -it deploy/orders-api -- dotnet-health-cli /health/detail
Expected: database check shows "Healthy". If it shows "Unhealthy":
→ Follow runbook: https://wiki.internal/runbooks/database-connectivity
## Step 3 — Check recent deployments
kubectl rollout history deploy/orders-api
If a deployment happened within the last 30 minutes:
kubectl rollout undo deploy/orders-api
→ Watch error rate drop in Grafana within 2 minutes.
## Step 4 — Escalate
If none of the above resolves the alert within 20 minutes, escalate to on-call lead.
Page: https://app.pagerduty.com/incidents/new?service=orders-apiRunbook Comments in Code
Link runbooks directly from the code that could trigger the issue. Future on-call engineers (and your future self) will thank you.
public sealed class OrderService(IOrderRepository repo, ILogger<OrderService> logger)
{
/// <summary>
/// Creates a new order and reserves inventory.
///
/// INCIDENT NOTE: If this method starts throwing TimeoutException at scale,
/// the likely cause is database connection pool exhaustion.
/// Runbook: https://wiki.internal/runbooks/db-pool-exhaustion
///
/// Known causes:
/// - Long-running transactions holding connections (check EF Core interceptor logs)
/// - Sudden traffic spike without connection pool scaling
/// - Deadlock in inventory reservation — see issue #1234
/// </summary>
public async Task<OrderResult> CreateOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
{
using var activity = ActivitySource.StartActivity("OrderService.CreateOrder");
activity?.SetTag("order.customer_id", cmd.CustomerId);
activity?.SetTag("order.item_count", cmd.Items.Count);
try
{
var order = await repo.CreateAsync(cmd, ct);
logger.LogInformation("Order {OrderId} created for customer {CustomerId}",
order.Id, cmd.CustomerId);
return OrderResult.Success(order);
}
catch (TimeoutException ex)
{
// This is the most common P2 failure mode for this service.
// See runbook: https://wiki.internal/runbooks/db-pool-exhaustion
logger.LogError(ex,
"Timeout creating order for customer {CustomerId} — likely pool exhaustion",
cmd.CustomerId);
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
}
private static readonly ActivitySource ActivitySource =
new("YourApp.Orders", "1.0.0");
}Structured Logging for Incident Correlation
During an incident you need to find all logs for a failing user, request, or operation within seconds. This requires structured logging with consistent field names.
Using Serilog with structured properties:
// Program.cs
builder.Host.UseSerilog((ctx, cfg) =>
{
cfg.ReadFrom.Configuration(ctx.Configuration)
.Enrich.FromLogContext()
.Enrich.WithMachineName()
.Enrich.WithEnvironmentName()
.WriteTo.Console(new CompactJsonFormatter()); // JSON in prod for log aggregation
});In your services, always log the key identifiers that let you reconstruct an incident timeline:
// Good: structured, searchable
logger.LogError(ex,
"Payment failed for order {OrderId}, customer {CustomerId}, amount {Amount:C}",
orderId, customerId, amount);
// Bad: string interpolation loses the structure
logger.LogError($"Payment failed: {orderId} {customerId} {amount}");The first form lets you search Seq / Elasticsearch / Azure Monitor for OrderId = "abc123" and get every log line for that order across every service that logged it — including the correlation ID.
The Postmortem — Blameless and Actionable
A blameless postmortem assumes that the engineers involved acted with the information and tools available to them at the time. The goal is not to assign blame but to improve the system.
Postmortem Structure
# Postmortem: Orders API P1 — 2026-05-20
**Severity**: P1
**Duration**: 47 minutes (14:22 – 15:09 UTC)
**Impact**: ~3,200 users unable to place orders; ~$140,000 GMV at risk
**Incident Commander**: [Name]
**Scribe**: [Name]
## Summary
A misconfigured connection string deployed in release v2.14.1 caused
the orders-api to fail all database writes starting at 14:22. The
issue was detected by the HighErrorRate alert at 14:27 (5-minute lag).
Rollback was completed at 15:09.
## Timeline
| Time (UTC) | Event |
|---|---|
| 14:15 | v2.14.1 deployed to production (automated canary, 10% traffic) |
| 14:22 | First 5xx errors appear in Grafana |
| 14:27 | HighErrorRate alert fires; on-call paged |
| 14:32 | On-call acknowledges, begins triage |
| 14:45 | Root cause identified (connection string) |
| 15:02 | Rollback initiated |
| 15:09 | All services healthy, alert resolved |
## Root Cause
The staging connection string was included in the production deployment
because the CI pipeline environment variable substitution was not validated
before artifact promotion. The staging DB has different credentials; all
writes failed with authentication errors.
## Contributing Factors
1. No automated smoke test verifying DB connectivity before canary promotion
2. Canary traffic was 10% — enough to trigger the alert but slow to detect
3. Health check did not check write access, only read (`SELECT 1`)
## What Went Well
- Alert fired within 5 minutes of first errors
- Runbook was followed accurately; no improvisation needed
- Rollback was clean and took < 7 minutes
## Action Items
| Item | Owner | Due |
|---|---|---|
| Add DB write smoke test to CI pipeline pre-promotion gate | @platform | 2026-06-03 |
| Update health check to verify write access with a no-op write | @platform | 2026-05-28 |
| Add env var validation step to deployment pipeline | @devops | 2026-06-03 |
| Reduce canary window from 10 min to 3 min for write-path changes | @platform | 2026-06-10 |
## Lessons Learned
- Write-path issues need faster canary promotion gates than read-path issues.
- Our health check gave false confidence — "Healthy" does not mean "can write".
- Environment variable substitution validation is a standard CI step we were missing.The Most Important Rule: Follow Up on Action Items
A postmortem with no closed action items is worse than no postmortem — it erodes trust in the process. Assign every item to a named person with a specific date. Review open items weekly until closed.
Incident Communication
During a P1, communication clarity prevents secondary chaos:
Initial notice (within 5 minutes of detection):
"We are investigating elevated error rates on the orders API. Users may experience failures when placing orders. We will update in 15 minutes."
Updates every 15 minutes:
"Update: root cause identified as a database configuration issue. Mitigation in progress. ETA 20 minutes."
Resolution:
"Resolved at 15:09 UTC. Root cause: misconfigured connection string in release v2.14.1. Rollback complete. All services healthy. Postmortem will be published within 48 hours."
Keep the channel (Slack, Teams, status page) updated even when you have nothing new to say. "Still investigating" every 15 minutes is better than silence.
Key Takeaways
- Correlation IDs in every request are the single highest-leverage investment for incident diagnosis. Implement
IncidentCorrelationMiddlewarebefore anything else. - Actionable health checks cut triage time from minutes to seconds. Return structured JSON with last error and timestamp, not just "Unhealthy".
- Runbooks linked from alerts are used; runbooks in a wiki index are ignored.
- Postmortems are only valuable if action items close. Track them in your sprint board, not a separate document.
- Blameless culture requires explicit reinforcement in every postmortem. Start the meeting with "we are here to improve the system, not assign fault."