Distributed Tracing Patterns: Correlate Requests Across Services

Why Distributed Tracing?

When a request touches five services, a 3-second latency is somewhere in those five hops. Logs tell you what happened on each service. Tracing tells you where the time went across all of them.

HTTP Request: 3,200ms total
  ├── API Gateway:      8ms
  ├── Orders Service:   45ms
  │     ├── DB Query:   38ms  ← bottleneck
  │     └── Serialise:  5ms
  ├── Products Service: 120ms
  │     ├── Cache miss: 5ms
  │     └── DB Query:   112ms ← second bottleneck
  └── Notifications:    3,000ms ← ← ← THE PROBLEM
        └── Email SMTP: 2,980ms (timeout waiting for relay)

Without tracing: "requests are slow, we don't know why". With tracing: "3 seconds in the SMTP call — check the email relay".

How Trace Propagation Works

Every span has a TraceId (same across all services) and a SpanId (unique per span). The parent span's ID is carried in HTTP headers.

Browser → API Gateway
  Header: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
                          │  trace-id (128bit)              parent-span-id   flags

API Gateway creates Span A (TraceId: 4bf92...)
  → calls Orders Service with traceparent header
    Orders Service creates Span B (TraceId: 4bf92..., ParentId: SpanA.Id)
      → calls Products Service with traceparent header
        Products Service creates Span C (TraceId: 4bf92..., ParentId: SpanB.Id)

OpenTelemetry handles this propagation automatically for HTTP clients and ASP.NET Core.

Setup: Automatic Propagation

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()   // extracts traceparent on inbound
        .AddHttpClientInstrumentation()   // injects traceparent on outbound
        .AddGrpcClientInstrumentation()   // propagates through gRPC
        .AddSource("OrderFlow.*")
        .AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")));

That's all that's needed for automatic propagation between .NET services.

Custom Spans with Meaningful Attributes

private static readonly ActivitySource Source = new("OrderFlow.Orders");

public async Task<Order> ProcessOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
{
    using var activity = Source.StartActivity("ProcessOrder");

    // Add business context to the span — searchable in Jaeger/Grafana
    activity?.SetTag("order.customerId",   cmd.CustomerId.ToString());
    activity?.SetTag("order.lineCount",    cmd.Lines.Count);
    activity?.SetTag("order.channel",      cmd.Channel);
    activity?.SetTag("order.totalAmount",  cmd.Lines.Sum(l => l.Quantity * l.UnitPrice));

    // Add events (point-in-time moments within the span)
    activity?.AddEvent(new ActivityEvent("ValidationStarted"));

    await ValidateAsync(cmd, ct);

    activity?.AddEvent(new ActivityEvent("ValidationPassed"));

    var order = await CreateInDbAsync(cmd, ct);

    activity?.SetTag("order.id", order.Id.ToString());
    activity?.SetStatus(ActivityStatusCode.Ok);

    return order;
}

Baggage: Cross-Service Context Propagation

Baggage travels with the trace across service boundaries — like trace context but for your own data.

// Set baggage at the entry point (API gateway, first service)
Activity.Current?.SetBaggage("tenant.id", tenantId);
Activity.Current?.SetBaggage("user.id",   userId);
Activity.Current?.SetBaggage("feature.flag.experiment", "variant-b");

// Read in any downstream service — automatically propagated
var tenantId = Activity.Current?.GetBaggageItem("tenant.id");
var userId   = Activity.Current?.GetBaggageItem("user.id");

// Use for per-tenant logging context
using (_logger.BeginScope(new { TenantId = tenantId, UserId = userId }))
{
    // All logs within this scope include tenant/user context
}

Warning: baggage is sent in HTTP headers on every request. Keep it small. Don't put large values in baggage.

Sampling Strategies

You can't afford to store every trace. Sampling selects which traces to keep.

Head-Based Sampling

Decision made at the first span. Simple and cheap.

// Sample 10% of requests
.SetSampler(new TraceIdRatioBasedSampler(0.1))

// Parent-based: respect upstream sampling decision
.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.1)))

Always Sample Errors and Slow Requests

public class SmartSampler : Sampler
{
    private readonly Sampler _inner;
    private readonly double _slowRequestThresholdMs;

    public SmartSampler(Sampler inner, double slowMs = 1000)
    {
        _inner                 = inner;
        _slowRequestThresholdMs = slowMs;
    }

    public override SamplingResult ShouldSample(in SamplingParameters parameters)
    {
        // Always sample errors
        if (parameters.Tags?.Any(t => t.Key == "error" && t.Value?.ToString() == "true") == true)
            return new SamplingResult(SamplingDecision.RecordAndSample);

        // Always sample slow requests (known at span end, not start — use processor instead)
        return _inner.ShouldSample(parameters);
    }
}

// Tail-based: keep slow/error spans retroactively
// Use OpenTelemetry Collector's tail sampling processor

Tail-Based Sampling in the Collector

YAML

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }

      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }

      - name: sample-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Correlating Logs and Traces

With OTel, logs automatically include TraceId and SpanId:

// Serilog with OTel integration
builder.Logging.AddOpenTelemetry(options =>
{
    options.IncludeScopes                     = true;
    options.IncludeFormattedMessage           = true;
    options.SetResourceBuilder(resourceBuilder);
    options.AddOtlpExporter();
});

// Or inject trace context into Serilog manually
Log.Logger = new LoggerConfiguration()
    .Enrich.WithProperty("TraceId", Activity.Current?.TraceId.ToString() ?? "")
    .Enrich.WithProperty("SpanId",  Activity.Current?.SpanId.ToString()  ?? "")
    .CreateLogger();

In Grafana, click a trace span → "View Logs" → shows all logs with that TraceId. The connection between traces and logs is the TraceId.

Debugging a Latency Problem with Traces

Scenario: users report checkout is slow (3–5 seconds) but only sometimes.

Step 1: Find slow traces in Jaeger/Grafana Tempo

Search: service=orders-api duration>2000ms last 1h

Step 2: Click a slow trace, look at the waterfall

Trace: abc123 (3,420ms total)
  ├─ orders-api: ProcessOrder (3,400ms)
  │    ├─ ValidateOrder (5ms) ← fast
  │    ├─ CheckStock (3,380ms) ← THIS IS THE PROBLEM
  │    │    └─ products-api: GetStock (3,370ms)
  │    │         └─ DB: SELECT stock (3,360ms)
  │    │              └─ wait: 3,200ms ← lock wait
  │    └─ CreateOrder (15ms) ← fast

Step 3: The DB call in products-api is waiting on a lock. Look at the span attributes:

db.system: sqlserver
db.statement: SELECT stock FROM Products WHERE Id = @id
db.sql.table: Products
lock_wait_ms: 3200

Step 4: Check products-api logs around that TraceId — find the locking query.

Finding: A batch import job was holding a table lock. Added NOLOCK hint to read queries, added a separate read replica for stock checks.

Trace-Based Alerting

YAML

# Grafana alert rule — alert when P99 trace duration > 2s
- alert: SlowOrderProcessing
  expr: |
    histogram_quantile(0.99,
      sum(rate(traces_spanmetrics_duration_milliseconds_bucket{
        service_name="orders-api",
        span_name="ProcessOrder"
      }[5m])) by (le)
    ) > 2000
  for: 5m
  labels:
    severity: warning

Interview Questions

Q: What is a trace ID and why is it the same across all services? The trace ID identifies a single end-to-end request as it passes through multiple services. It's generated at the first service (or injected by the load balancer) and propagated in the traceparent HTTP header. Every span in the trace shares the same trace ID — this is how Jaeger/Grafana can assemble the full picture of a single request.

Q: What is the difference between head-based and tail-based sampling? Head-based: the sampling decision is made at the first span before any of the trace is known. Simple, no buffering needed. Tail-based: the collector buffers the full trace, then decides whether to keep it based on the outcome (error, latency). Tail-based is more useful — you can always keep errors and slow requests without biasing toward keeping only "interesting" traces that happen to start with an error.

Q: What is OpenTelemetry Baggage? Key-value pairs that propagate with the trace across service boundaries in HTTP headers. Unlike span attributes (visible in that span only), baggage is available to all downstream services. Use it for cross-cutting concerns like tenant ID, user ID, or A/B test variant. Keep values small — they're sent on every HTTP request.

Q: How do you find which service caused a latency problem using traces? Open a slow trace in Jaeger or Grafana Tempo, look at the waterfall diagram. The longest span is the bottleneck. Click it to see attributes (DB query, HTTP URL, cache hit/miss). If the long span is a DB call, look for lock waits or slow query warnings. If it's an HTTP call to another service, find that service's span and repeat. The waterfall makes the culprit obvious.