System Design · Lesson 11 of 26
Observability: Logs, Metrics & Traces
Why Observability Matters
In a monolith, you debug with a stack trace and a local debugger. In a distributed system, a single user request touches 5–10 services. When it fails, the error appears in Service A but the root cause is in Service E.
Observability is the ability to understand the internal state of a system from its external outputs. It has three pillars:
| Pillar | What It Answers | Tool | |--------|----------------|------| | Logs | What happened and when? | Serilog + Seq / ELK | | Metrics | How is the system performing? | Prometheus + Grafana | | Traces | Which services did a request touch? | OpenTelemetry + Jaeger/Zipkin |
You need all three — each answers different questions.
Structured Logging with Serilog
Plain text logs are useless at scale. Structured logs emit key-value pairs that can be queried.
// ❌ Unstructured — you cannot query "all orders over £100"
_logger.LogInformation("Order ord-123 confirmed for £150 by customer cust-456");
// ✅ Structured — every field is queryable
_logger.LogInformation(
"Order {OrderId} confirmed for {Total} by customer {CustomerId}",
orderId, total, customerId);Setup
// Program.cs
builder.Host.UseSerilog((ctx, services, config) =>
{
config
.ReadFrom.Configuration(ctx.Configuration)
.ReadFrom.Services(services)
.Enrich.FromLogContext()
.Enrich.WithMachineName()
.Enrich.WithEnvironmentName()
.Enrich.WithProperty("Application", "OrderService")
.WriteTo.Console(new ExpressionTemplate(
"[{@t:HH:mm:ss} {@l:u3}] {#if SourceContext is not null}{SourceContext}: {#end}{@m}\n{@x}"))
.WriteTo.Seq(ctx.Configuration["Seq:Url"]!);
});// appsettings.json
{
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft.AspNetCore": "Warning",
"Microsoft.EntityFrameworkCore": "Warning"
}
}
},
"Seq": { "Url": "http://seq:5341" }
}Correlation IDs
Every request gets a correlation ID that flows through all downstream calls:
// Middleware to propagate correlation ID
public class CorrelationIdMiddleware
{
private const string HeaderName = "X-Correlation-ID";
private readonly RequestDelegate _next;
public async Task InvokeAsync(HttpContext context)
{
var correlationId = context.Request.Headers[HeaderName].FirstOrDefault()
?? Guid.NewGuid().ToString("N");
context.Response.Headers[HeaderName] = correlationId;
using (LogContext.PushProperty("CorrelationId", correlationId))
{
context.Items["CorrelationId"] = correlationId;
await _next(context);
}
}
}// Pass correlation ID to downstream HTTP calls
public class CorrelationIdDelegatingHandler : DelegatingHandler
{
private readonly IHttpContextAccessor _accessor;
public CorrelationIdDelegatingHandler(IHttpContextAccessor accessor) => _accessor = accessor;
protected override Task<HttpResponseMessage> SendAsync(
HttpRequestMessage request, CancellationToken ct)
{
var correlationId = _accessor.HttpContext?.Items["CorrelationId"]?.ToString();
if (correlationId is not null)
request.Headers.TryAddWithoutValidation("X-Correlation-ID", correlationId);
return base.SendAsync(request, ct);
}
}Log Levels — What to Log Where
Verbose/Trace → internal state, loop iterations (dev only — never in prod)
Debug → diagnostic info useful in development
Information → significant events: request received, order confirmed, user logged in
Warning → unexpected state, recoverable: cache miss, retry attempt, deprecated feature used
Error → operation failed: exception caught, command handler threw
Fatal/Critical → system cannot continue: DB connection pool exhausted, config missingMetrics with Prometheus and Grafana
Metrics track numbers over time — request rate, error rate, latency percentiles, queue depth.
The Four Golden Signals
| Signal | Description | Alert When | |--------|-------------|------------| | Latency | How long requests take (p50, p95, p99) | p99 > SLA threshold | | Traffic | Request rate (req/s) | Sudden drop (maybe an upstream outage) | | Errors | Error rate (5xx/s, failed transactions/s) | Error rate > X% | | Saturation | How full is the system (CPU%, queue depth, connection pool) | Approaching limits |
Setup in .NET
// Package: prometheus-net.AspNetCore
builder.Services.AddMetrics();
app.UseHttpMetrics(options =>
{
options.AddCustomLabel("service", _ => "order-service");
options.ReduceStatusCodeCardinality(); // group 4xx together
});
app.MapMetrics("/metrics"); // Prometheus scrapes this endpointCustom Business Metrics
// Register metrics as singletons
public class OrderMetrics
{
private readonly Counter _ordersCreated;
private readonly Counter _ordersConfirmed;
private readonly Counter _ordersFailed;
private readonly Histogram _orderValue;
private readonly Gauge _pendingOrders;
public OrderMetrics()
{
_ordersCreated = Metrics.CreateCounter(
"orders_created_total",
"Total orders created",
new CounterConfiguration { LabelNames = ["region"] });
_ordersConfirmed = Metrics.CreateCounter(
"orders_confirmed_total",
"Total orders confirmed");
_ordersFailed = Metrics.CreateCounter(
"orders_failed_total",
"Total orders failed",
new CounterConfiguration { LabelNames = ["reason"] });
_orderValue = Metrics.CreateHistogram(
"order_value_gbp",
"Distribution of order values in GBP",
new HistogramConfiguration
{
Buckets = Histogram.LinearBuckets(start: 10, width: 10, count: 20)
});
_pendingOrders = Metrics.CreateGauge(
"orders_pending",
"Current count of orders in Pending state");
}
public void RecordOrderCreated(string region)
=> _ordersCreated.WithLabels(region).Inc();
public void RecordOrderConfirmed(decimal value)
{
_ordersConfirmed.Inc();
_orderValue.Observe((double)value);
}
public void RecordOrderFailed(string reason)
=> _ordersFailed.WithLabels(reason).Inc();
public void SetPendingOrders(long count)
=> _pendingOrders.Set(count);
}// Use in command handler
public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
private readonly IOrderRepository _repo;
private readonly OrderMetrics _metrics;
public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
{
var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
?? throw new NotFoundException(cmd.OrderId);
order.Confirm();
await _repo.SaveChangesAsync(ct);
_metrics.RecordOrderConfirmed(order.Total.Amount);
}
}Grafana Dashboard Queries (PromQL)
# Request rate (last 5 minutes)
rate(http_requests_total{job="order-service"}[5m])
# Error rate (5xx as % of total)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# p99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Orders confirmed per minute
rate(orders_confirmed_total[1m]) * 60Distributed Tracing with OpenTelemetry
A trace represents the full lifecycle of a request. Each service adds a span — a named, timed unit of work. Spans are linked by a TraceId that flows across service boundaries.
TraceId: abc-123
│
├── Span: OrderService.POST /orders (0ms → 48ms)
│ ├── Span: EF Core: INSERT Orders (2ms → 8ms)
│ └── Span: HTTP GET InventoryService (10ms → 45ms)
│ ├── Span: EF Core: SELECT Stock (2ms → 6ms)
│ └── Span: Redis GET cache (1ms → 2ms)
└── Span: OrderConfirmedConsumer (200ms → 225ms)Setup in .NET
// Package: OpenTelemetry.Extensions.Hosting + exporters
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource =>
{
resource.AddService(
serviceName: "order-service",
serviceVersion: "1.0.0");
})
.WithTracing(tracing =>
{
tracing
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, request) =>
activity.SetTag("user.id", request.HttpContext.User.FindFirst("sub")?.Value);
})
.AddEntityFrameworkCoreInstrumentation(options =>
{
options.SetDbStatementForText = true; // include SQL in spans
})
.AddHttpClientInstrumentation()
.AddSource("OrderService") // custom spans
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otelcollector:4317");
});
})
.WithMetrics(metrics =>
{
metrics
.AddAspNetCoreInstrumentation()
.AddRuntimeInstrumentation()
.AddPrometheusExporter();
});Custom Spans for Business Operations
public class ConfirmOrderCommandHandler : IRequestHandler<ConfirmOrderCommand>
{
private static readonly ActivitySource _activitySource = new("OrderService");
public async Task Handle(ConfirmOrderCommand cmd, CancellationToken ct)
{
using var activity = _activitySource.StartActivity("ConfirmOrder");
activity?.SetTag("order.id", cmd.OrderId.ToString());
var order = await _repo.GetByIdAsync(cmd.OrderId, ct)
?? throw new NotFoundException(cmd.OrderId);
order.Confirm();
activity?.SetTag("order.total", order.Total.Amount.ToString());
activity?.SetTag("order.currency", order.Total.Currency);
await _repo.SaveChangesAsync(ct);
activity?.SetStatus(ActivityStatusCode.Ok);
}
}OpenTelemetry Collector Config
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Health Checks
Health checks let Kubernetes (or Azure App Service) know if your service is ready to serve traffic.
builder.Services.AddHealthChecks()
.AddDbContextCheck<AppDbContext>("database")
.AddRedis(builder.Configuration["Redis:ConnectionString"]!, "redis")
.AddAzureServiceBusTopic(
builder.Configuration["ServiceBus:ConnectionString"]!,
"orders",
"servicebus")
.AddCheck<OutboxHealthCheck>("outbox");
// Separate liveness (is the process alive?) from readiness (is it ready for traffic?)
app.MapHealthChecks("/health/live", new HealthCheckOptions { Predicate = _ => false });
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
});// Custom health check: alert if outbox is backing up
public class OutboxHealthCheck : IHealthCheck
{
private readonly AppDbContext _db;
public OutboxHealthCheck(AppDbContext db) => _db = db;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context, CancellationToken ct)
{
var unprocessed = await _db.OutboxMessages
.CountAsync(m => m.ProcessedAt == null && m.OccurredAt < DateTimeOffset.UtcNow.AddMinutes(-5), ct);
if (unprocessed > 100)
return HealthCheckResult.Unhealthy($"Outbox has {unprocessed} stuck messages.");
if (unprocessed > 10)
return HealthCheckResult.Degraded($"Outbox has {unprocessed} unprocessed messages.");
return HealthCheckResult.Healthy();
}
}Alerting Rules
Write alerts against your metrics. Common ones:
# prometheus-alerts.yaml
groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="order-service"}[5m])) > 0.05
for: 2m
annotations:
summary: "Error rate above 5% for 2 minutes"
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le)
) > 2
for: 5m
annotations:
summary: "p99 latency above 2s"
- alert: OutboxBackingUp
expr: orders_outbox_pending > 50
for: 5m
annotations:
summary: "Outbox processor may be stuck"
- alert: CircuitBreakerOpen
expr: resilience_pipeline_open{service="order-service"} == 1
for: 1m
annotations:
summary: "Circuit breaker is open — downstream dependency down"Putting It Together: Local Dev Stack
# docker-compose.observability.yml
services:
seq:
image: datalust/seq:latest
ports:
- "5341:80"
environment:
ACCEPT_EULA: "Y"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC receiver
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector.yaml"]
volumes:
- ./otel-collector.yaml:/etc/otel-collector.yaml
ports:
- "4317:4317" # OTLP gRPCKey Takeaways
- Structured logs make searching and alerting possible — plain text logs are archaeology at scale
- Correlation IDs are non-negotiable in distributed systems — without them you cannot trace a request across services
- The four golden signals (latency, traffic, errors, saturation) cover 90% of what you need to alert on
- OpenTelemetry is the standard — instrument once, export to Jaeger, Zipkin, Azure Monitor, Datadog, or any OTLP-compatible backend
- Health checks with separate liveness and readiness endpoints let Kubernetes safely route traffic
- Alerting on the right metrics (p99 latency, error rate, circuit breaker state) means you find out about problems before your users do
- Observability is not something you add later — instrument from day one, it's far cheaper than debugging blind in production