Distributed Tracing Patterns: Correlate Requests Across Services
Master distributed tracing in .NET microservices. Covers trace propagation, sampling strategies, baggage, span attributes, tail-based sampling, Jaeger vs Tempo, and debugging latency across service boundaries.
Why Distributed Tracing?
When a request touches five services, a 3-second latency is somewhere in those five hops. Logs tell you what happened on each service. Tracing tells you where the time went across all of them.
HTTP Request: 3,200ms total
āāā API Gateway: 8ms
āāā Orders Service: 45ms
ā āāā DB Query: 38ms ā bottleneck
ā āāā Serialise: 5ms
āāā Products Service: 120ms
ā āāā Cache miss: 5ms
ā āāā DB Query: 112ms ā second bottleneck
āāā Notifications: 3,000ms ā ā ā THE PROBLEM
āāā Email SMTP: 2,980ms (timeout waiting for relay)Without tracing: "requests are slow, we don't know why". With tracing: "3 seconds in the SMTP call ā check the email relay".
How Trace Propagation Works
Every span has a TraceId (same across all services) and a SpanId (unique per span). The parent span's ID is carried in HTTP headers.
Browser ā API Gateway
Header: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
ā trace-id (128bit) parent-span-id flags
API Gateway creates Span A (TraceId: 4bf92...)
ā calls Orders Service with traceparent header
Orders Service creates Span B (TraceId: 4bf92..., ParentId: SpanA.Id)
ā calls Products Service with traceparent header
Products Service creates Span C (TraceId: 4bf92..., ParentId: SpanB.Id)OpenTelemetry handles this propagation automatically for HTTP clients and ASP.NET Core.
Setup: Automatic Propagation
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation() // extracts traceparent on inbound
.AddHttpClientInstrumentation() // injects traceparent on outbound
.AddGrpcClientInstrumentation() // propagates through gRPC
.AddSource("OrderFlow.*")
.AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")));That's all that's needed for automatic propagation between .NET services.
Custom Spans with Meaningful Attributes
private static readonly ActivitySource Source = new("OrderFlow.Orders");
public async Task<Order> ProcessOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
{
using var activity = Source.StartActivity("ProcessOrder");
// Add business context to the span ā searchable in Jaeger/Grafana
activity?.SetTag("order.customerId", cmd.CustomerId.ToString());
activity?.SetTag("order.lineCount", cmd.Lines.Count);
activity?.SetTag("order.channel", cmd.Channel);
activity?.SetTag("order.totalAmount", cmd.Lines.Sum(l => l.Quantity * l.UnitPrice));
// Add events (point-in-time moments within the span)
activity?.AddEvent(new ActivityEvent("ValidationStarted"));
await ValidateAsync(cmd, ct);
activity?.AddEvent(new ActivityEvent("ValidationPassed"));
var order = await CreateInDbAsync(cmd, ct);
activity?.SetTag("order.id", order.Id.ToString());
activity?.SetStatus(ActivityStatusCode.Ok);
return order;
}Baggage: Cross-Service Context Propagation
Baggage travels with the trace across service boundaries ā like trace context but for your own data.
// Set baggage at the entry point (API gateway, first service)
Activity.Current?.SetBaggage("tenant.id", tenantId);
Activity.Current?.SetBaggage("user.id", userId);
Activity.Current?.SetBaggage("feature.flag.experiment", "variant-b");
// Read in any downstream service ā automatically propagated
var tenantId = Activity.Current?.GetBaggageItem("tenant.id");
var userId = Activity.Current?.GetBaggageItem("user.id");
// Use for per-tenant logging context
using (_logger.BeginScope(new { TenantId = tenantId, UserId = userId }))
{
// All logs within this scope include tenant/user context
}Warning: baggage is sent in HTTP headers on every request. Keep it small. Don't put large values in baggage.
Sampling Strategies
You can't afford to store every trace. Sampling selects which traces to keep.
Head-Based Sampling
Decision made at the first span. Simple and cheap.
// Sample 10% of requests
.SetSampler(new TraceIdRatioBasedSampler(0.1))
// Parent-based: respect upstream sampling decision
.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.1)))Always Sample Errors and Slow Requests
public class SmartSampler : Sampler
{
private readonly Sampler _inner;
private readonly double _slowRequestThresholdMs;
public SmartSampler(Sampler inner, double slowMs = 1000)
{
_inner = inner;
_slowRequestThresholdMs = slowMs;
}
public override SamplingResult ShouldSample(in SamplingParameters parameters)
{
// Always sample errors
if (parameters.Tags?.Any(t => t.Key == "error" && t.Value?.ToString() == "true") == true)
return new SamplingResult(SamplingDecision.RecordAndSample);
// Always sample slow requests (known at span end, not start ā use processor instead)
return _inner.ShouldSample(parameters);
}
}
// Tail-based: keep slow/error spans retroactively
// Use OpenTelemetry Collector's tail sampling processorTail-Based Sampling in the Collector
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: sample-rest
type: probabilistic
probabilistic: { sampling_percentage: 5 }Correlating Logs and Traces
With OTel, logs automatically include TraceId and SpanId:
// Serilog with OTel integration
builder.Logging.AddOpenTelemetry(options =>
{
options.IncludeScopes = true;
options.IncludeFormattedMessage = true;
options.SetResourceBuilder(resourceBuilder);
options.AddOtlpExporter();
});
// Or inject trace context into Serilog manually
Log.Logger = new LoggerConfiguration()
.Enrich.WithProperty("TraceId", Activity.Current?.TraceId.ToString() ?? "")
.Enrich.WithProperty("SpanId", Activity.Current?.SpanId.ToString() ?? "")
.CreateLogger();In Grafana, click a trace span ā "View Logs" ā shows all logs with that TraceId. The connection between traces and logs is the TraceId.
Debugging a Latency Problem with Traces
Scenario: users report checkout is slow (3ā5 seconds) but only sometimes.
Step 1: Find slow traces in Jaeger/Grafana Tempo
Search: service=orders-api duration>2000ms last 1hStep 2: Click a slow trace, look at the waterfall
Trace: abc123 (3,420ms total)
āā orders-api: ProcessOrder (3,400ms)
ā āā ValidateOrder (5ms) ā fast
ā āā CheckStock (3,380ms) ā THIS IS THE PROBLEM
ā ā āā products-api: GetStock (3,370ms)
ā ā āā DB: SELECT stock (3,360ms)
ā ā āā wait: 3,200ms ā lock wait
ā āā CreateOrder (15ms) ā fastStep 3: The DB call in products-api is waiting on a lock. Look at the span attributes:
db.system: sqlserver
db.statement: SELECT stock FROM Products WHERE Id = @id
db.sql.table: Products
lock_wait_ms: 3200Step 4: Check products-api logs around that TraceId ā find the locking query.
Finding: A batch import job was holding a table lock. Added NOLOCK hint to read queries, added a separate read replica for stock checks.
Trace-Based Alerting
# Grafana alert rule ā alert when P99 trace duration > 2s
- alert: SlowOrderProcessing
expr: |
histogram_quantile(0.99,
sum(rate(traces_spanmetrics_duration_milliseconds_bucket{
service_name="orders-api",
span_name="ProcessOrder"
}[5m])) by (le)
) > 2000
for: 5m
labels:
severity: warningInterview Questions
Q: What is a trace ID and why is it the same across all services?
The trace ID identifies a single end-to-end request as it passes through multiple services. It's generated at the first service (or injected by the load balancer) and propagated in the traceparent HTTP header. Every span in the trace shares the same trace ID ā this is how Jaeger/Grafana can assemble the full picture of a single request.
Q: What is the difference between head-based and tail-based sampling? Head-based: the sampling decision is made at the first span before any of the trace is known. Simple, no buffering needed. Tail-based: the collector buffers the full trace, then decides whether to keep it based on the outcome (error, latency). Tail-based is more useful ā you can always keep errors and slow requests without biasing toward keeping only "interesting" traces that happen to start with an error.
Q: What is OpenTelemetry Baggage? Key-value pairs that propagate with the trace across service boundaries in HTTP headers. Unlike span attributes (visible in that span only), baggage is available to all downstream services. Use it for cross-cutting concerns like tenant ID, user ID, or A/B test variant. Keep values small ā they're sent on every HTTP request.
Q: How do you find which service caused a latency problem using traces? Open a slow trace in Jaeger or Grafana Tempo, look at the waterfall diagram. The longest span is the bottleneck. Click it to see attributes (DB query, HTTP URL, cache hit/miss). If the long span is a DB call, look for lock waits or slow query warnings. If it's an HTTP call to another service, find that service's span and repeat. The waterfall makes the culprit obvious.
Enjoyed this article?
Explore the Backend Systems learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.