Back to blog
System Designadvanced

Observability in Distributed Systems: Logs, Metrics & Traces

Master observability for distributed systems — structured logging with Serilog, metrics with Prometheus and Grafana, distributed tracing with OpenTelemetry, health checks, and alerting. Full .NET implementation guide.

LearnixoApril 13, 20268 min read
ObservabilityOpenTelemetryDistributed TracingSerilogPrometheusGrafana.NET
Share:𝕏

Why Observability Matters

In a monolith, you debug with a stack trace and a local debugger. In a distributed system, a single user request touches 5–10 services. When it fails, the error appears in Service A but the root cause is in Service E.

Observability is the ability to understand the internal state of a system from its external outputs. It has three pillars:

| Pillar | What It Answers | Tool | |--------|----------------|------| | Logs | What happened and when? | Serilog + Seq / ELK | | Metrics | How is the system performing? | Prometheus + Grafana | | Traces | Which services did a request touch? | OpenTelemetry + Jaeger/Zipkin |

You need all three — each answers different questions.


Structured Logging with Serilog

Plain text logs are useless at scale. Structured logs emit key-value pairs that can be queried.

C#
// ❌ Unstructured — you cannot query "all orders over £100"
_logger.LogInformation("Order ord-123 confirmed for £150 by customer cust-456");

// ✅ Structured — every field is queryable
_logger.LogInformation(
    "Order {OrderId} confirmed for {Total} by customer {CustomerId}",
    orderId, total, customerId);

Setup

C#
// Program.cs
builder.Host.UseSerilog((ctx, services, config) =>
{
    config
        .ReadFrom.Configuration(ctx.Configuration)
        .ReadFrom.Services(services)
        .Enrich.FromLogContext()
        .Enrich.WithMachineName()
        .Enrich.WithEnvironmentName()
        .Enrich.WithProperty("Application", "OrderService")
        .WriteTo.Console(new ExpressionTemplate(
            "[{@t:HH:mm:ss} {@l:u3}] {#if SourceContext is not null}{SourceContext}: {#end}{@m}\n{@x}"))
        .WriteTo.Seq(ctx.Configuration["Seq:Url"]!);
});
JSON
// appsettings.json
{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft.AspNetCore": "Warning",
        "Microsoft.EntityFrameworkCore": "Warning"
      }
    }
  },
  "Seq": { "Url": "http://seq:5341" }
}

Correlation IDs

Every request gets a correlation ID that flows through all downstream calls:

C#
// Middleware to propagate correlation ID
public class CorrelationIdMiddleware
{
    private const string HeaderName = "X-Correlation-ID";
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        var correlationId = context.Request.Headers[HeaderName].FirstOrDefault()
                            ?? Guid.NewGuid().ToString("N");

        context.Response.Headers[HeaderName] = correlationId;

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            context.Items["CorrelationId"] = correlationId;
            await _next(context);
        }
    }
}
C#
// Pass correlation ID to downstream HTTP calls
public class CorrelationIdDelegatingHandler : DelegatingHandler
{
    private readonly IHttpContextAccessor _accessor;
    public CorrelationIdDelegatingHandler(IHttpContextAccessor accessor) => _accessor = accessor;

    protected override Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request, CancellationToken ct)
    {
        var correlationId = _accessor.HttpContext?.Items["CorrelationId"]?.ToString();
        if (correlationId is not null)
            request.Headers.TryAddWithoutValidation("X-Correlation-ID", correlationId);
        return base.SendAsync(request, ct);
    }
}

Log Levels — What to Log Where

Verbose/Trace → internal state, loop iterations (dev only — never in prod)
Debug         → diagnostic info useful in development
Information   → significant events: request received, order confirmed, user logged in
Warning       → unexpected state, recoverable: cache miss, retry attempt, deprecated feature used
Error         → operation failed: exception caught, command handler threw
Fatal/Critical → system cannot continue: DB connection pool exhausted, config missing

Metrics with Prometheus and Grafana

Metrics track numbers over time — request rate, error rate, latency percentiles, queue depth.

The Four Golden Signals

| Signal | Description | Alert When | |--------|-------------|------------| | Latency | How long requests take (p50, p95, p99) | p99 > SLA threshold | | Traffic | Request rate (req/s) | Sudden drop (maybe an upstream outage) | | Errors | Error rate (5xx/s, failed transactions/s) | Error rate > X% | | Saturation | How full is the system (CPU%, queue depth, connection pool) | Approaching limits |

Setup in .NET

C#
// Package: prometheus-net.AspNetCore
builder.Services.AddMetrics();

app.UseHttpMetrics(options =>
{
    options.AddCustomLabel("service", _ => "order-service");
    options.ReduceStatusCodeCardinality();  // group 4xx together
});

app.MapMetrics("/metrics");  // Prometheus scrapes this endpoint

Custom Business Metrics

C#
// Register metrics as singletons
public class OrderMetrics
{
    private readonly Counter     _ordersCreated;
    private readonly Counter     _ordersConfirmed;
    private readonly Counter     _ordersFailed;
    private readonly Histogram   _orderValue;
    private readonly Gauge       _pendingOrders;

    public OrderMetrics()
    {
        _ordersCreated  = Metrics.CreateCounter(
            "orders_created_total",
            "Total orders created",
            new CounterConfiguration { LabelNames = ["region"] });

        _ordersConfirmed = Metrics.CreateCounter(
            "orders_confirmed_total",
            "Total orders confirmed");

        _ordersFailed = Metrics.CreateCounter(
            "orders_failed_total",
            "Total orders failed",
            new CounterConfiguration { LabelNames = ["reason"] });

        _orderValue = Metrics.CreateHistogram(
            "order_value_gbp",
            "Distribution of order values in GBP",
            new HistogramConfiguration
            {
                Buckets = Histogram.LinearBuckets(start: 10, width: 10, count: 20)
            });

        _pendingOrders = Metrics.CreateGauge(
            "orders_pending",
            "Current count of orders in Pending state");
    }

    public void RecordOrderCreated(string region)
        => _ordersCreated.WithLabels(region).Inc();

    public void RecordOrderConfirmed(decimal value)
    {
        _ordersConfirmed.Inc();
        _orderValue.Observe((double)value);
    }

    public void RecordOrderFailed(string reason)
        => _ordersFailed.WithLabels(reason).Inc();

    public void SetPendingOrders(long count)
        => _pendingOrders.Set(count);
}
C#
// Use in command handler
public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
    private readonly IOrderRepository _repo;
    private readonly OrderMetrics _metrics;

    public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
    {
        var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
            ?? throw new NotFoundException(cmd.OrderId);

        order.Confirm();
        await _repo.SaveChangesAsync(ct);

        _metrics.RecordOrderConfirmed(order.Total.Amount);
    }
}

Grafana Dashboard Queries (PromQL)

PROMQL
# Request rate (last 5 minutes)
rate(http_requests_total{job="order-service"}[5m])

# Error rate (5xx as % of total)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Orders confirmed per minute
rate(orders_confirmed_total[1m]) * 60

Distributed Tracing with OpenTelemetry

A trace represents the full lifecycle of a request. Each service adds a span — a named, timed unit of work. Spans are linked by a TraceId that flows across service boundaries.

TraceId: abc-123
  │
  ├── Span: OrderService.POST /orders (0ms → 48ms)
  │     ├── Span: EF Core: INSERT Orders (2ms → 8ms)
  │     └── Span: HTTP GET InventoryService (10ms → 45ms)
  │           ├── Span: EF Core: SELECT Stock (2ms → 6ms)
  │           └── Span: Redis GET cache (1ms → 2ms)
  └── Span: OrderConfirmedConsumer (200ms → 225ms)

Setup in .NET

C#
// Package: OpenTelemetry.Extensions.Hosting + exporters
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
    {
        resource.AddService(
            serviceName:    "order-service",
            serviceVersion: "1.0.0");
    })
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation(options =>
            {
                options.RecordException = true;
                options.EnrichWithHttpRequest = (activity, request) =>
                    activity.SetTag("user.id", request.HttpContext.User.FindFirst("sub")?.Value);
            })
            .AddEntityFrameworkCoreInstrumentation(options =>
            {
                options.SetDbStatementForText = true;  // include SQL in spans
            })
            .AddHttpClientInstrumentation()
            .AddSource("OrderService")            // custom spans
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://otelcollector:4317");
            });
    })
    .WithMetrics(metrics =>
    {
        metrics
            .AddAspNetCoreInstrumentation()
            .AddRuntimeInstrumentation()
            .AddPrometheusExporter();
    });

Custom Spans for Business Operations

C#
public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
    private static readonly ActivitySource _activitySource = new("OrderService");

    public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
    {
        using var activity = _activitySource.StartActivity("ConfirmOrder");
        activity?.SetTag("order.id", cmd.OrderId.ToString());

        var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
            ?? throw new NotFoundException(cmd.OrderId);

        order.Confirm();

        activity?.SetTag("order.total",    order.Total.Amount.ToString());
        activity?.SetTag("order.currency", order.Total.Currency);

        await _repo.SaveChangesAsync(ct);

        activity?.SetStatus(ActivityStatusCode.Ok);
    }
}

OpenTelemetry Collector Config

YAML
# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Health Checks

Health checks let Kubernetes (or Azure App Service) know if your service is ready to serve traffic.

C#
builder.Services.AddHealthChecks()
    .AddDbContextCheck<AppDbContext>("database")
    .AddRedis(builder.Configuration["Redis:ConnectionString"]!, "redis")
    .AddAzureServiceBusTopic(
        builder.Configuration["ServiceBus:ConnectionString"]!,
        "orders",
        "servicebus")
    .AddCheck<OutboxHealthCheck>("outbox");

// Separate liveness (is the process alive?) from readiness (is it ready for traffic?)
app.MapHealthChecks("/health/live",  new HealthCheckOptions { Predicate = _ => false });
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
});
C#
// Custom health check: alert if outbox is backing up
public class OutboxHealthCheck : IHealthCheck
{
    private readonly AppDbContext _db;
    public OutboxHealthCheck(AppDbContext db) => _db = db;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context, CancellationToken ct)
    {
        var unprocessed = await _db.OutboxMessages
            .CountAsync(m => m.ProcessedAt == null && m.OccurredAt < DateTimeOffset.UtcNow.AddMinutes(-5), ct);

        if (unprocessed > 100)
            return HealthCheckResult.Unhealthy($"Outbox has {unprocessed} stuck messages.");

        if (unprocessed > 10)
            return HealthCheckResult.Degraded($"Outbox has {unprocessed} unprocessed messages.");

        return HealthCheckResult.Healthy();
    }
}

Alerting Rules

Write alerts against your metrics. Common ones:

YAML
# prometheus-alerts.yaml
groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="order-service"}[5m])) > 0.05
        for: 2m
        annotations:
          summary: "Error rate above 5% for 2 minutes"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le)
          ) > 2
        for: 5m
        annotations:
          summary: "p99 latency above 2s"

      - alert: OutboxBackingUp
        expr: orders_outbox_pending > 50
        for: 5m
        annotations:
          summary: "Outbox processor may be stuck"

      - alert: CircuitBreakerOpen
        expr: resilience_pipeline_open{service="order-service"} == 1
        for: 1m
        annotations:
          summary: "Circuit breaker is open — downstream dependency down"

Putting It Together: Local Dev Stack

YAML
# docker-compose.observability.yml
services:
  seq:
    image: datalust/seq:latest
    ports:
      - "5341:80"
    environment:
      ACCEPT_EULA: "Y"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "14250:14250"   # gRPC receiver

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector.yaml"]
    volumes:
      - ./otel-collector.yaml:/etc/otel-collector.yaml
    ports:
      - "4317:4317"     # OTLP gRPC

Key Takeaways

  • Structured logs make searching and alerting possible — plain text logs are archaeology at scale
  • Correlation IDs are non-negotiable in distributed systems — without them you cannot trace a request across services
  • The four golden signals (latency, traffic, errors, saturation) cover 90% of what you need to alert on
  • OpenTelemetry is the standard — instrument once, export to Jaeger, Zipkin, Azure Monitor, Datadog, or any OTLP-compatible backend
  • Health checks with separate liveness and readiness endpoints let Kubernetes safely route traffic
  • Alerting on the right metrics (p99 latency, error rate, circuit breaker state) means you find out about problems before your users do
  • Observability is not something you add later — instrument from day one, it's far cheaper than debugging blind in production

Enjoyed this article?

Explore the System Design learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.