Distributed Observability — Tracing Across Microservices

The Three Pillars of Observability

Logs:    what happened — structured text records of events
Metrics: how the system is performing — counters, gauges, histograms
Traces:  how a request flowed — end-to-end journey across services

In a monolith: one log file, one process to debug.
In microservices: 8 services, 8 log streams, 8 dashboards.
Without observability: you know something is broken, but not where.
With observability: "request abc-123 spent 400ms in LabService.GetResults()"

OpenTelemetry in .NET

// NuGet: OpenTelemetry.Extensions.Hosting
//         OpenTelemetry.Instrumentation.AspNetCore
//         OpenTelemetry.Instrumentation.Http
//         OpenTelemetry.Instrumentation.SqlClient
//         OpenTelemetry.Exporter.OpenTelemetryProtocol

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
        resource.AddService(
            serviceName:    "prescription-service",
            serviceVersion: "1.2.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation(options =>
            {
                options.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            })
            .AddHttpClientInstrumentation()
            .AddSqlClientInstrumentation()
            .AddSource("SystemForge.Prescriptions.*")
            .AddOtlpExporter(opts =>
                opts.Endpoint = new Uri("http://otel-collector:4317"));
    })
    .WithMetrics(metrics =>
    {
        metrics
            .AddAspNetCoreInstrumentation()
            .AddRuntimeInstrumentation()
            .AddOtlpExporter();
    });

Correlation ID Propagation

// Middleware: ensure every request has a correlation ID that flows across services
public sealed class CorrelationIdMiddleware : IMiddleware
{
    private const string Header = "X-Correlation-Id";

    public async Task InvokeAsync(HttpContext context, RequestDelegate next)
    {
        var correlationId = context.Request.Headers[Header].FirstOrDefault()
            ?? Activity.Current?.TraceId.ToString()
            ?? Guid.NewGuid().ToString("N");

        context.Response.Headers[Header] = correlationId;

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await next(context);
        }
    }
}

// HttpClient: forward correlation ID to downstream services
public sealed class CorrelationIdDelegatingHandler : DelegatingHandler
{
    protected override Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request, CancellationToken ct)
    {
        // Use W3C traceparent — automatically propagated by OpenTelemetry HttpClient instrumentation
        // For custom header: add X-Correlation-Id from IHttpContextAccessor
        return base.SendAsync(request, ct);
    }
}

Health Checks

// NuGet: AspNetCore.HealthChecks.SqlServer, AspNetCore.HealthChecks.Redis

builder.Services.AddHealthChecks()
    .AddSqlServer(
        connectionString: config.GetConnectionString("Default")!,
        name:             "sql-server",
        tags:             new[] { "ready" })
    .AddRedis(
        connectionString: config.GetConnectionString("Redis")!,
        name:             "redis",
        tags:             new[] { "ready" })
    .AddUrlGroup(
        uri:  new Uri("http://patient-service/health/live"),
        name: "patient-service-upstream",
        tags: new[] { "ready" });

// Two endpoints:
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    // Liveness: is the process up? (no dependency checks)
    Predicate = check => check.Tags.Contains("live")
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    // Readiness: can the service handle traffic? (all dependencies checked)
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

Custom Business Metrics

// Track domain-specific metrics — beyond just HTTP latency
private static readonly Counter<long> PrescriptionsCreated =
    Meter.CreateCounter<long>(
        "prescriptions.created.total",
        description: "Total prescriptions created");

private static readonly Histogram<double> InrValueDistribution =
    Meter.CreateHistogram<double>(
        "clinical.inr.value",
        unit: "INR units",
        description: "Distribution of INR values recorded");

// In handler:
PrescriptionsCreated.Add(1, new TagList
{
    ["medication_name"] = prescription.MedicationName,
    ["ward_id"]         = prescription.WardId?.ToString() ?? "unassigned",
});

InrValueDistribution.Record(inrValue, new TagList
{
    ["patient_ward"] = wardCode,
    ["in_range"]     = (inrValue >= 2.0 && inrValue <= 3.0).ToString(),
});

// Dashboard: "INR values out of range in the last hour by ward" — visible in Prometheus/Grafana

Production issue I've seen: A clinical system had 9 microservices with no distributed tracing. When a ward nurse reported "prescriptions are taking too long," the on-call engineer had to check 9 separate log streams and manually correlate timestamps. It took 40 minutes to find that one upstream API (the patient demographics service) had a slow query degrading every downstream call. Adding OpenTelemetry with a single trace spanning all 9 services made the same root cause visible in 30 seconds in Jaeger.

Key Takeaway

Observability = logs + metrics + traces. OpenTelemetry is the standard for distributed tracing in .NET — instrument once, export to Jaeger, Grafana Tempo, or Azure Monitor. Propagate correlation IDs across all service boundaries. Implement liveness and readiness health checks separately. Add business-domain metrics (prescriptions created, INR values out of range) — not just HTTP latency. Without distributed tracing in microservices, debugging is guesswork.

Distributed Observability — Tracing Across Microservices

The Three Pillars of Observability

OpenTelemetry in .NET

Correlation ID Propagation

Health Checks

Custom Business Metrics

Key Takeaway

Enjoyed this article?

Leave a comment