Distributed Observability — Tracing Across Microservices
Implement observability in .NET microservices: distributed tracing with OpenTelemetry, centralized structured logging with correlation IDs, health checks, metrics with Prometheus, and building a production monitoring stack.
The Three Pillars of Observability
Logs: what happened — structured text records of events
Metrics: how the system is performing — counters, gauges, histograms
Traces: how a request flowed — end-to-end journey across services
In a monolith: one log file, one process to debug.
In microservices: 8 services, 8 log streams, 8 dashboards.
Without observability: you know something is broken, but not where.
With observability: "request abc-123 spent 400ms in LabService.GetResults()"OpenTelemetry in .NET
// NuGet: OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.SqlClient
// OpenTelemetry.Exporter.OpenTelemetryProtocol
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource =>
resource.AddService(
serviceName: "prescription-service",
serviceVersion: "1.2.0",
serviceInstanceId: Environment.MachineName))
.WithTracing(tracing =>
{
tracing
.AddAspNetCoreInstrumentation(options =>
{
options.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
})
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation()
.AddSource("SystemForge.Prescriptions.*")
.AddOtlpExporter(opts =>
opts.Endpoint = new Uri("http://otel-collector:4317"));
})
.WithMetrics(metrics =>
{
metrics
.AddAspNetCoreInstrumentation()
.AddRuntimeInstrumentation()
.AddOtlpExporter();
});Correlation ID Propagation
// Middleware: ensure every request has a correlation ID that flows across services
public sealed class CorrelationIdMiddleware : IMiddleware
{
private const string Header = "X-Correlation-Id";
public async Task InvokeAsync(HttpContext context, RequestDelegate next)
{
var correlationId = context.Request.Headers[Header].FirstOrDefault()
?? Activity.Current?.TraceId.ToString()
?? Guid.NewGuid().ToString("N");
context.Response.Headers[Header] = correlationId;
using (LogContext.PushProperty("CorrelationId", correlationId))
{
await next(context);
}
}
}
// HttpClient: forward correlation ID to downstream services
public sealed class CorrelationIdDelegatingHandler : DelegatingHandler
{
protected override Task<HttpResponseMessage> SendAsync(
HttpRequestMessage request, CancellationToken ct)
{
// Use W3C traceparent — automatically propagated by OpenTelemetry HttpClient instrumentation
// For custom header: add X-Correlation-Id from IHttpContextAccessor
return base.SendAsync(request, ct);
}
}Health Checks
// NuGet: AspNetCore.HealthChecks.SqlServer, AspNetCore.HealthChecks.Redis
builder.Services.AddHealthChecks()
.AddSqlServer(
connectionString: config.GetConnectionString("Default")!,
name: "sql-server",
tags: new[] { "ready" })
.AddRedis(
connectionString: config.GetConnectionString("Redis")!,
name: "redis",
tags: new[] { "ready" })
.AddUrlGroup(
uri: new Uri("http://patient-service/health/live"),
name: "patient-service-upstream",
tags: new[] { "ready" });
// Two endpoints:
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
// Liveness: is the process up? (no dependency checks)
Predicate = check => check.Tags.Contains("live")
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
// Readiness: can the service handle traffic? (all dependencies checked)
Predicate = check => check.Tags.Contains("ready"),
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});Custom Business Metrics
// Track domain-specific metrics — beyond just HTTP latency
private static readonly Counter<long> PrescriptionsCreated =
Meter.CreateCounter<long>(
"prescriptions.created.total",
description: "Total prescriptions created");
private static readonly Histogram<double> InrValueDistribution =
Meter.CreateHistogram<double>(
"clinical.inr.value",
unit: "INR units",
description: "Distribution of INR values recorded");
// In handler:
PrescriptionsCreated.Add(1, new TagList
{
["medication_name"] = prescription.MedicationName,
["ward_id"] = prescription.WardId?.ToString() ?? "unassigned",
});
InrValueDistribution.Record(inrValue, new TagList
{
["patient_ward"] = wardCode,
["in_range"] = (inrValue >= 2.0 && inrValue <= 3.0).ToString(),
});
// Dashboard: "INR values out of range in the last hour by ward" — visible in Prometheus/GrafanaProduction issue I've seen: A clinical system had 9 microservices with no distributed tracing. When a ward nurse reported "prescriptions are taking too long," the on-call engineer had to check 9 separate log streams and manually correlate timestamps. It took 40 minutes to find that one upstream API (the patient demographics service) had a slow query degrading every downstream call. Adding OpenTelemetry with a single trace spanning all 9 services made the same root cause visible in 30 seconds in Jaeger.
Key Takeaway
Observability = logs + metrics + traces. OpenTelemetry is the standard for distributed tracing in .NET — instrument once, export to Jaeger, Grafana Tempo, or Azure Monitor. Propagate correlation IDs across all service boundaries. Implement liveness and readiness health checks separately. Add business-domain metrics (prescriptions created, INR values out of range) — not just HTTP latency. Without distributed tracing in microservices, debugging is guesswork.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.