System Design · Lesson 11 of 26

Observability: Logs, Metrics & Traces

Why Observability Matters

In a monolith, you debug with a stack trace and a local debugger. In a distributed system, a single user request touches 5–10 services. When it fails, the error appears in Service A but the root cause is in Service E.

Observability is the ability to understand the internal state of a system from its external outputs. It has three pillars:

| Pillar | What It Answers | Tool | |--------|----------------|------| | Logs | What happened and when? | Serilog + Seq / ELK | | Metrics | How is the system performing? | Prometheus + Grafana | | Traces | Which services did a request touch? | OpenTelemetry + Jaeger/Zipkin |

You need all three — each answers different questions.

Structured Logging with Serilog

Plain text logs are useless at scale. Structured logs emit key-value pairs that can be queried.

// ❌ Unstructured — you cannot query "all orders over £100"
_logger.LogInformation("Order ord-123 confirmed for £150 by customer cust-456");

// ✅ Structured — every field is queryable
_logger.LogInformation(
    "Order {OrderId} confirmed for {Total} by customer {CustomerId}",
    orderId, total, customerId);

Setup

// Program.cs
builder.Host.UseSerilog((ctx, services, config) =>
{
    config
        .ReadFrom.Configuration(ctx.Configuration)
        .ReadFrom.Services(services)
        .Enrich.FromLogContext()
        .Enrich.WithMachineName()
        .Enrich.WithEnvironmentName()
        .Enrich.WithProperty("Application", "OrderService")
        .WriteTo.Console(new ExpressionTemplate(
            "[{@t:HH:mm:ss} {@l:u3}] {#if SourceContext is not null}{SourceContext}: {#end}{@m}\n{@x}"))
        .WriteTo.Seq(ctx.Configuration["Seq:Url"]!);
});

JSON

// appsettings.json
{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft.AspNetCore": "Warning",
        "Microsoft.EntityFrameworkCore": "Warning"
      }
    }
  },
  "Seq": { "Url": "http://seq:5341" }
}

Correlation IDs

Every request gets a correlation ID that flows through all downstream calls:

// Middleware to propagate correlation ID
public class CorrelationIdMiddleware
{
    private const string HeaderName = "X-Correlation-ID";
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        var correlationId = context.Request.Headers[HeaderName].FirstOrDefault()
                            ?? Guid.NewGuid().ToString("N");

        context.Response.Headers[HeaderName] = correlationId;

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            context.Items["CorrelationId"] = correlationId;
            await _next(context);
        }
    }
}

// Pass correlation ID to downstream HTTP calls
public class CorrelationIdDelegatingHandler : DelegatingHandler
{
    private readonly IHttpContextAccessor _accessor;
    public CorrelationIdDelegatingHandler(IHttpContextAccessor accessor) => _accessor = accessor;

    protected override Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request, CancellationToken ct)
    {
        var correlationId = _accessor.HttpContext?.Items["CorrelationId"]?.ToString();
        if (correlationId is not null)
            request.Headers.TryAddWithoutValidation("X-Correlation-ID", correlationId);
        return base.SendAsync(request, ct);
    }
}

Log Levels — What to Log Where

Verbose/Trace → internal state, loop iterations (dev only — never in prod)
Debug         → diagnostic info useful in development
Information   → significant events: request received, order confirmed, user logged in
Warning       → unexpected state, recoverable: cache miss, retry attempt, deprecated feature used
Error         → operation failed: exception caught, command handler threw
Fatal/Critical → system cannot continue: DB connection pool exhausted, config missing

Metrics with Prometheus and Grafana

Metrics track numbers over time — request rate, error rate, latency percentiles, queue depth.

The Four Golden Signals

| Signal | Description | Alert When | |--------|-------------|------------| | Latency | How long requests take (p50, p95, p99) | p99 > SLA threshold | | Traffic | Request rate (req/s) | Sudden drop (maybe an upstream outage) | | Errors | Error rate (5xx/s, failed transactions/s) | Error rate > X% | | Saturation | How full is the system (CPU%, queue depth, connection pool) | Approaching limits |

Setup in .NET

// Package: prometheus-net.AspNetCore
builder.Services.AddMetrics();

app.UseHttpMetrics(options =>
{
    options.AddCustomLabel("service", _ => "order-service");
    options.ReduceStatusCodeCardinality();  // group 4xx together
});

app.MapMetrics("/metrics");  // Prometheus scrapes this endpoint

Custom Business Metrics

// Register metrics as singletons
public class OrderMetrics
{
    private readonly Counter     _ordersCreated;
    private readonly Counter     _ordersConfirmed;
    private readonly Counter     _ordersFailed;
    private readonly Histogram   _orderValue;
    private readonly Gauge       _pendingOrders;

    public OrderMetrics()
    {
        _ordersCreated  = Metrics.CreateCounter(
            "orders_created_total",
            "Total orders created",
            new CounterConfiguration { LabelNames = ["region"] });

        _ordersConfirmed = Metrics.CreateCounter(
            "orders_confirmed_total",
            "Total orders confirmed");

        _ordersFailed = Metrics.CreateCounter(
            "orders_failed_total",
            "Total orders failed",
            new CounterConfiguration { LabelNames = ["reason"] });

        _orderValue = Metrics.CreateHistogram(
            "order_value_gbp",
            "Distribution of order values in GBP",
            new HistogramConfiguration
            {
                Buckets = Histogram.LinearBuckets(start: 10, width: 10, count: 20)
            });

        _pendingOrders = Metrics.CreateGauge(
            "orders_pending",
            "Current count of orders in Pending state");
    }

    public void RecordOrderCreated(string region)
        => _ordersCreated.WithLabels(region).Inc();

    public void RecordOrderConfirmed(decimal value)
    {
        _ordersConfirmed.Inc();
        _orderValue.Observe((double)value);
    }

    public void RecordOrderFailed(string reason)
        => _ordersFailed.WithLabels(reason).Inc();

    public void SetPendingOrders(long count)
        => _pendingOrders.Set(count);
}

// Use in command handler
public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
    private readonly IOrderRepository _repo;
    private readonly OrderMetrics _metrics;

    public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
    {
        var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
            ?? throw new NotFoundException(cmd.OrderId);

        order.Confirm();
        await _repo.SaveChangesAsync(ct);

        _metrics.RecordOrderConfirmed(order.Total.Amount);
    }
}

Grafana Dashboard Queries (PromQL)

PROMQL

# Request rate (last 5 minutes)
rate(http_requests_total{job="order-service"}[5m])

# Error rate (5xx as % of total)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Orders confirmed per minute
rate(orders_confirmed_total[1m]) * 60

Distributed Tracing with OpenTelemetry

A trace represents the full lifecycle of a request. Each service adds a span — a named, timed unit of work. Spans are linked by a TraceId that flows across service boundaries.

TraceId: abc-123
  │
  ├── Span: OrderService.POST /orders (0ms → 48ms)
  │     ├── Span: EF Core: INSERT Orders (2ms → 8ms)
  │     └── Span: HTTP GET InventoryService (10ms → 45ms)
  │           ├── Span: EF Core: SELECT Stock (2ms → 6ms)
  │           └── Span: Redis GET cache (1ms → 2ms)
  └── Span: OrderConfirmedConsumer (200ms → 225ms)

Setup in .NET

// Package: OpenTelemetry.Extensions.Hosting + exporters
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
    {
        resource.AddService(
            serviceName:    "order-service",
            serviceVersion: "1.0.0");
    })
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation(options =>
            {
                options.RecordException = true;
                options.EnrichWithHttpRequest = (activity, request) =>
                    activity.SetTag("user.id", request.HttpContext.User.FindFirst("sub")?.Value);
            })
            .AddEntityFrameworkCoreInstrumentation(options =>
            {
                options.SetDbStatementForText = true;  // include SQL in spans
            })
            .AddHttpClientInstrumentation()
            .AddSource("OrderService")            // custom spans
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://otelcollector:4317");
            });
    })
    .WithMetrics(metrics =>
    {
        metrics
            .AddAspNetCoreInstrumentation()
            .AddRuntimeInstrumentation()
            .AddPrometheusExporter();
    });

Custom Spans for Business Operations

public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
    private static readonly ActivitySource _activitySource = new("OrderService");

    public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
    {
        using var activity = _activitySource.StartActivity("ConfirmOrder");
        activity?.SetTag("order.id", cmd.OrderId.ToString());

        var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
            ?? throw new NotFoundException(cmd.OrderId);

        order.Confirm();

        activity?.SetTag("order.total",    order.Total.Amount.ToString());
        activity?.SetTag("order.currency", order.Total.Currency);

        await _repo.SaveChangesAsync(ct);

        activity?.SetStatus(ActivityStatusCode.Ok);
    }
}

OpenTelemetry Collector Config

YAML

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Health Checks

Health checks let Kubernetes (or Azure App Service) know if your service is ready to serve traffic.

builder.Services.AddHealthChecks()
    .AddDbContextCheck<AppDbContext>("database")
    .AddRedis(builder.Configuration["Redis:ConnectionString"]!, "redis")
    .AddAzureServiceBusTopic(
        builder.Configuration["ServiceBus:ConnectionString"]!,
        "orders",
        "servicebus")
    .AddCheck<OutboxHealthCheck>("outbox");

// Separate liveness (is the process alive?) from readiness (is it ready for traffic?)
app.MapHealthChecks("/health/live",  new HealthCheckOptions { Predicate = _ => false });
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
});

// Custom health check: alert if outbox is backing up
public class OutboxHealthCheck : IHealthCheck
{
    private readonly AppDbContext _db;
    public OutboxHealthCheck(AppDbContext db) => _db = db;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context, CancellationToken ct)
    {
        var unprocessed = await _db.OutboxMessages
            .CountAsync(m => m.ProcessedAt == null && m.OccurredAt < DateTimeOffset.UtcNow.AddMinutes(-5), ct);

        if (unprocessed > 100)
            return HealthCheckResult.Unhealthy($"Outbox has {unprocessed} stuck messages.");

        if (unprocessed > 10)
            return HealthCheckResult.Degraded($"Outbox has {unprocessed} unprocessed messages.");

        return HealthCheckResult.Healthy();
    }
}

Alerting Rules

Write alerts against your metrics. Common ones:

YAML

# prometheus-alerts.yaml
groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="order-service"}[5m])) > 0.05
        for: 2m
        annotations:
          summary: "Error rate above 5% for 2 minutes"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le)
          ) > 2
        for: 5m
        annotations:
          summary: "p99 latency above 2s"

      - alert: OutboxBackingUp
        expr: orders_outbox_pending > 50
        for: 5m
        annotations:
          summary: "Outbox processor may be stuck"

      - alert: CircuitBreakerOpen
        expr: resilience_pipeline_open{service="order-service"} == 1
        for: 1m
        annotations:
          summary: "Circuit breaker is open — downstream dependency down"

Putting It Together: Local Dev Stack

YAML

# docker-compose.observability.yml
services:
  seq:
    image: datalust/seq:latest
    ports:
      - "5341:80"
    environment:
      ACCEPT_EULA: "Y"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "14250:14250"   # gRPC receiver

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector.yaml"]
    volumes:
      - ./otel-collector.yaml:/etc/otel-collector.yaml
    ports:
      - "4317:4317"     # OTLP gRPC

Key Takeaways

Structured logs make searching and alerting possible — plain text logs are archaeology at scale
Correlation IDs are non-negotiable in distributed systems — without them you cannot trace a request across services
The four golden signals (latency, traffic, errors, saturation) cover 90% of what you need to alert on
OpenTelemetry is the standard — instrument once, export to Jaeger, Zipkin, Azure Monitor, Datadog, or any OTLP-compatible backend
Health checks with separate liveness and readiness endpoints let Kubernetes safely route traffic
Alerting on the right metrics (p99 latency, error rate, circuit breaker state) means you find out about problems before your users do
Observability is not something you add later — instrument from day one, it's far cheaper than debugging blind in production

Distributed Patterns: Saga, Outbox, Circuit Breaker

Next Lesson

Microservices — What They Are and What They Cost