Back to blog
System Designadvanced

Reliability, Testing & Monitoring: The Senior Engineer's Playbook

The complete operational reliability picture for distributed systems — testing pyramid for microservices, structured logging, distributed tracing, SLIs/SLOs, circuit breaker and retry patterns, and the error handling model that makes systems production-grade.

SystemForgeApril 18, 202616 min read
ReliabilityTestingMonitoringObservabilityCircuit BreakerSLODistributed Systems.NETArchitectureInterview
Share:𝕏

The difference between a junior and senior engineer is often not the code they write — it is whether the system keeps working when things go wrong. Reliability is built at three levels: testing (preventing failures), observability (detecting failures), and resilience patterns (containing failures). This guide covers all three as a unified discipline.


The Reliability Mindset

Start with this: everything fails. Networks partition, disks fill, third-party APIs return 503, pods restart mid-request, database connections pool-exhaust. The question is never "will this fail?" but "how does the system behave when it does?"

Failure is normal. Design for it.

Bad:    "This can't happen in production"
Good:   "When this happens in production, here's what the system does"

A production system must answer four questions at any moment:

  1. Is it working? (availability)
  2. What is it doing? (observability)
  3. When did it break? (detection)
  4. Why did it break? (diagnosis)

Part 1: Testing — The Distributed Systems Testing Pyramid

The classic testing pyramid (unit → integration → E2E) does not map well to microservices. A revised model:

                    ┌─────────────────────┐
                    │   End-to-End Tests  │  Few, slow, fragile
                    │   (real environment)│  verify critical paths only
                    ├─────────────────────┤
                    │  Contract Tests     │  Consumer-driven, per integration
                    │  (Pact / MSW)       │  catch breaking API changes
                    ├─────────────────────┤
                    │ Integration Tests   │  Per service, real dependencies
                    │ (WebAppFactory +    │  in containers (Testcontainers)
                    │  Testcontainers)    │
                    ├─────────────────────┤
                    │    Unit Tests       │  Many, fast, isolated
                    │    (domain logic)   │  pure business rules
                    └─────────────────────┘

Unit Tests: What to Actually Test

Unit tests should cover domain logic and business rules, not infrastructure or framework behaviour.

C#
// Test the rule, not the plumbing
[Fact]
public void Order_CannotShip_WhenNotPaid()
{
    var order = Order.Place(new OrderId("ORD-1"), items, customerId);
    // order is in Pending state

    var act = () => order.Ship(new TrackingId("TRK-1"));

    act.Should().Throw<InvalidOperationException>()
       .WithMessage("Order must be paid before shipping.");
}

[Fact]
public void Order_Total_IncludesAllLineItems()
{
    var order = Order.Place(new OrderId("ORD-1"), new[]
    {
        new OrderItem(ProductId.New(), quantity: 2, unitPrice: 49.99m),
        new OrderItem(ProductId.New(), quantity: 1, unitPrice: 10.00m)
    }, customerId);

    order.Total.Should().Be(109.98m);
}

Do not unit-test:

  • EF Core queries (they test against a mock that doesn't reflect actual SQL behaviour)
  • HTTP controllers (no business logic should be in controllers)
  • Serialisation (test it end-to-end via integration test)

Integration Tests: The Most Valuable Tests

An integration test runs your actual application stack — real HTTP, real database, real middleware — against controlled data. This is what catches the bugs unit tests miss: query plans, middleware ordering, auth policy evaluation, JSON serialization, EF Core relationship behaviour.

Setup with WebApplicationFactory + Testcontainers:

C#
// Shared test fixture — starts once per test collection
public class IntegrationTestFixture : IAsyncLifetime
{
    private readonly PostgreSqlContainer _postgres = new PostgreSqlBuilder()
        .WithImage("postgres:16-alpine")
        .Build();

    public HttpClient Client { get; private set; } = default!;

    public async Task InitializeAsync()
    {
        await _postgres.StartAsync();

        var factory = new WebApplicationFactory<Program>()
            .WithWebHostBuilder(builder =>
            {
                builder.ConfigureServices(services =>
                {
                    // Replace real DB with test container
                    services.RemoveAll<DbContextOptions<AppDbContext>>();
                    services.AddDbContext<AppDbContext>(opts =>
                        opts.UseNpgsql(_postgres.GetConnectionString()));

                    // Swap external HTTP calls with stubs
                    services.AddHttpClient<IPaymentClient, PaymentClient>()
                        .ConfigurePrimaryHttpMessageHandler(() => new StubPaymentHandler());
                });
            });

        Client = factory.CreateClient();
        await RunMigrationsAsync(_postgres.GetConnectionString());
    }

    public async Task DisposeAsync() => await _postgres.DisposeAsync();
}

// Test — real HTTP, real DB
[Collection("Integration")]
public class OrderApiTests(IntegrationTestFixture fixture)
{
    [Fact]
    public async Task PlaceOrder_ReturnsCreated_AndPersistsOrder()
    {
        var request = new { Items = new[] { new { ProductId = "P-1", Quantity = 2 } } };

        var response = await fixture.Client.PostAsJsonAsync("/api/orders", request);

        response.StatusCode.Should().Be(HttpStatusCode.Created);

        var order = await response.Content.ReadFromJsonAsync<OrderResponse>();
        order!.Status.Should().Be("Pending");
        order.Total.Should().Be(99.98m);

        // Verify DB state — not just the response
        using var scope = fixture.Services.CreateScope();
        var db = scope.ServiceProvider.GetRequiredService<AppDbContext>();
        var persisted = await db.Orders.FindAsync(order.Id);
        persisted.Should().NotBeNull();
    }
}

Key principle: reset database state between tests. Use transactions that rollback, or truncate tables in a BeforeEach. Shared state between tests produces brittle tests that fail in random order.

C#
// Clean state via transaction rollback
public async Task InitializeAsync()
{
    _transaction = await _db.Database.BeginTransactionAsync();
}

public async Task DisposeAsync()
{
    await _transaction.RollbackAsync();
}

Contract Tests: Preventing Breaking API Changes

In microservices, Service A is the consumer of Service B's API. When Service B changes its contract, Service A breaks — often discovered in production.

Consumer-Driven Contract Testing (Pact) flips this: Consumer A defines what it expects from Service B. Provider B runs consumer A's expectations as part of its test suite. If Provider B breaks Consumer A's contract, CI fails before deployment.

Consumer (Order Service) defines:
  GET /api/products/{id}
  Response: { "id": "...", "name": "...", "price": 0.0, "inStock": true }

Provider (Product Service) runs the consumer's expectations:
  1. Start real Product Service (against in-memory store)
  2. Send the request Consumer defined
  3. Assert the response matches Consumer's contract
  → If Product team renames "price" to "unitPrice", this test fails
  → PR is blocked before deployment

This replaces end-to-end tests for API contract verification and gives the provider team immediate feedback when a change would break a consumer.

What Integration Testing Looks Like for Messaging

For event-driven services, integration tests need a real broker or an in-process substitute:

C#
[Fact]
public async Task ProcessOrder_PublishesOrderPlacedEvent()
{
    // Use Azure Service Bus emulator or RabbitMQ Testcontainer
    var receivedEvents = new List<OrderPlacedEvent>();
    var consumer = _harness.GetConsumerHarness<OrderPlacedConsumer>();

    await _harness.Bus.Publish(new PlaceOrderCommand { ... });

    // Wait for the consumer to process (up to 5 seconds)
    (await consumer.Consumed.Any<PlaceOrderCommand>()).Should().BeTrue();

    // Assert event was published downstream
    (await _harness.Published.Any<OrderPlacedEvent>()).Should().BeTrue();
}

MassTransit provides InMemoryTestHarness for this pattern — a real in-process bus for testing message flows without an external broker.


Part 2: Observability — The Three Pillars

Observability means being able to answer "what is the system doing right now?" from the system's external outputs. The three pillars are logs, metrics, and traces. All three together.

Logs: Structure First

Unstructured logs (Console.WriteLine) are for development. Production logs are structured — machine-readable JSON that can be queried, aggregated, and correlated.

C#
// Bad — unstructured
logger.LogInformation($"Order {orderId} placed by user {userId}");

// Good — structured
logger.LogInformation("Order {OrderId} placed by {UserId}", orderId, userId);

// Best — with correlation context
using (logger.BeginScope(new Dictionary<string, object>
{
    ["CorrelationId"] = correlationId,
    ["OrderId"] = orderId,
    ["UserId"] = userId
}))
{
    logger.LogInformation("Order placement started");
    // All log entries in this scope carry the correlation context
}

What to log at each level:

| Level | When | Example | |-------|------|---------| | Trace | Detailed diagnostic (dev only) | SQL query text | | Debug | Diagnostic info (disabled in prod) | Cache hit/miss | | Information | Normal operations | Request received, order placed | | Warning | Unexpected but handled | Retry attempt 2/3, fallback triggered | | Error | Operation failed, action required | Payment processing failed | | Critical | System-level failure | DB connection pool exhausted |

What never belongs in logs:

  • Passwords, tokens, keys
  • PII (email, phone, SSN) — pseudonymise at the logging boundary
  • Credit card numbers, health record numbers

Distributed Tracing: Following a Request Across Services

A trace is a collection of spans representing work done across services for a single logical operation. Each span has a TraceId (same across all services) and a SpanId (unique per operation).

TraceId: 4bf92f3577b34da6a3ce929d0e0e4736

  ┌────────────────────────────────────────────────────────┐
  │ Span: API Gateway /POST /orders (12ms)                 │
  │  ├─ Span: OrderService.PlaceOrder (10ms)               │
  │  │    ├─ Span: OrderService → PostgreSQL (2ms)         │
  │  │    └─ Span: OrderService → ServiceBus.Publish (1ms) │
  │  └─ Span: Auth Token Validation (1ms)                  │
  └────────────────────────────────────────────────────────┘

In .NET, OpenTelemetry propagates trace context automatically:

C#
// Program.cs
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()    // HTTP requests
        .AddEntityFrameworkCoreInstrumentation()  // DB calls
        .AddHttpClientInstrumentation()    // outbound HTTP
        .AddAzureMonitorTraceExporter(o =>
            o.ConnectionString = builder.Configuration["APPLICATIONINSIGHTS_CONNECTION_STRING"]));

When TraceId is propagated in HTTP headers (traceparent) and Service Bus message properties, Application Insights reconstructs the full end-to-end trace across service boundaries.

Custom spans for business operations:

C#
private static readonly ActivitySource _activitySource = new("OrderService");

public async Task<Order> PlaceOrderAsync(PlaceOrderCommand command)
{
    using var activity = _activitySource.StartActivity("PlaceOrder");
    activity?.SetTag("order.customerId", command.CustomerId);
    activity?.SetTag("order.itemCount", command.Items.Count);

    var order = Order.Place(OrderId.New(), command.Items, command.CustomerId);
    await _repository.SaveAsync(order);

    activity?.SetTag("order.id", order.Id.ToString());
    return order;
}

These custom spans appear in Application Insights / Jaeger alongside framework-generated spans — giving you business-level trace visibility.

Metrics: SLIs and SLOs

SLI (Service Level Indicator): a measurable quantity that describes system behaviour. Error rate, P99 latency, throughput.

SLO (Service Level Objective): a target for an SLI. "99.9% of requests complete in < 500ms."

SLA (Service Level Agreement): a contract with a customer. "We will maintain 99.9% uptime."

SLI: P99 latency of /api/orders POST
SLO: P99 latency < 500ms, measured over 30-day rolling window
SLA: 99.9% of requests succeed (derived from error SLO)

Define SLOs before you deploy. Instrument to measure them:

C#
// Custom metrics with .NET Meter API (OpenTelemetry)
private static readonly Meter _meter = new("OrderService");
private static readonly Histogram<double> _orderDuration =
    _meter.CreateHistogram<double>("order.placement.duration", "ms");
private static readonly Counter<int> _orderErrors =
    _meter.CreateCounter<int>("order.placement.errors");

public async Task<Order> PlaceOrderAsync(PlaceOrderCommand command)
{
    var sw = Stopwatch.StartNew();
    try
    {
        var order = await _inner.PlaceOrderAsync(command);
        _orderDuration.Record(sw.Elapsed.TotalMilliseconds,
            new KeyValuePair<string, object?>("result", "success"));
        return order;
    }
    catch (Exception ex)
    {
        _orderErrors.Add(1,
            new KeyValuePair<string, object?>("exception", ex.GetType().Name));
        throw;
    }
}

Error budget: if your SLO is 99.9% success rate over 30 days, you have 43.2 minutes of "error budget" — time you can be degraded without breaching the SLO. When the error budget is burning fast, stop new feature work and focus on reliability.

Alerting: Alert on Symptoms, Not Causes

Alert on what users experience, not on internal system state:

Bad alert:  "CPU > 80%" → may not affect users
Bad alert:  "Disk > 70%" → may never be a problem

Good alert: "Error rate > 1% over 5 minutes"   ← user-facing symptom
Good alert: "P99 latency > 2 seconds"           ← user-facing symptom
Good alert: "DLQ message count > 0"             ← processing failure
Good alert: "Consumer lag > 10,000 messages"    ← falling behind

Page (urgent):  Error rate > 5%, SLO breach imminent
Ticket (next business day): Consumer lag growing slowly over 24h

Part 3: Resilience Patterns — Containing Failures

The Failure Cascade (Why Resilience Matters)

User → API Gateway → Order Service → Inventory Service (slow)
                            │
             Threads pile up here, connection pool exhausts
                            │
             Order Service starts rejecting ALL requests
                            │
             API Gateway sees errors from Order Service
                            │
             Everything fails — for every user

One slow downstream service took out the entire platform. Resilience patterns exist to contain this.

Retry with Exponential Backoff + Jitter

Retry is for transient failures — errors that resolve themselves (network blip, pod restart, brief resource contention).

C#
// .NET 8 — Microsoft.Extensions.Resilience (Polly under the hood)
builder.Services.AddHttpClient<IInventoryClient, InventoryClient>()
    .AddResilienceHandler("inventory", pipeline =>
    {
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromMilliseconds(200),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,    // ← critical: prevents thundering herd
            ShouldHandle = args => args.Outcome.Result?.StatusCode is
                HttpStatusCode.ServiceUnavailable or
                HttpStatusCode.TooManyRequests or
                HttpStatusCode.RequestTimeout
                    ? new ValueTask<bool>(true)
                    : new ValueTask<bool>(args.Outcome.Exception is not null)
        });
    });

Why jitter? Without jitter, all retrying clients back off for the same duration — and then all slam the recovering service simultaneously. Jitter spreads retries over a random window, giving the recovering service a chance.

Do not retry:

  • 400 Bad Request — invalid input, retrying will not fix it
  • 401 Unauthorized — auth problem, retrying will not fix it
  • 403 Forbidden — permission problem
  • 404 Not Found — if the resource does not exist, it won't appear on retry
  • Business logic errors — only transient infrastructure errors should retry

Circuit Breaker: Failing Fast

Retrying a service that is down for 30 minutes means every request to your service takes 3× timeout before failing. The circuit breaker detects this and fails immediately for a period, preventing thread exhaustion and giving the downstream service time to recover.

CLOSED (normal):
  Request → Downstream ← responds
  Circuit records: success / failure

  After threshold failures (e.g., 50% fail in 10 seconds):
  → Circuit transitions to OPEN

OPEN (broken downstream):
  Request → Circuit Breaker → FAIL IMMEDIATELY (no call made)
  No threads consumed. Fast failure to caller.

  After break duration (e.g., 30 seconds):
  → Circuit transitions to HALF-OPEN

HALF-OPEN (probing):
  One request → Downstream
  Success → Circuit closes (CLOSED)
  Failure → Circuit reopens (OPEN)
C#
pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
    SamplingDuration             = TimeSpan.FromSeconds(10),
    FailureRatio                 = 0.5,    // 50% failures within sampling window
    MinimumThroughput            = 10,     // need at least 10 requests to evaluate
    BreakDuration                = TimeSpan.FromSeconds(30),
    OnOpened  = args => { logger.LogWarning("Circuit opened for {Service}", "Inventory"); return ValueTask.CompletedTask; },
    OnClosed  = args => { logger.LogInformation("Circuit closed — {Service} healthy", "Inventory"); return ValueTask.CompletedTask; }
});

Timeout: Bounding the Worst Case

Without explicit timeouts, a hung downstream holds your thread indefinitely. Timeouts bound the worst case:

C#
pipeline.AddTimeout(new HttpTimeoutStrategyOptions
{
    Timeout = TimeSpan.FromSeconds(3)   // fail after 3s, do not wait indefinitely
});

Set timeouts at every layer. API Gateway timeout, HTTP client timeout, and database command timeout should all be set and should be consistent (database timeout < HTTP client timeout < API Gateway timeout).

Bulkhead: Isolating Failure Domains

The bulkhead pattern limits concurrent calls to a downstream service. If Inventory Service is slow, only the bulkhead's allocated threads are consumed — the rest of the Order Service's thread pool remains available for other operations.

C#
// Separate HttpClient per downstream service — each has its own connection pool
// If Inventory is slow: it exhausts its own pool, not the shared pool

services.AddHttpClient<IInventoryClient>()
    .ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
    {
        PooledConnectionLifetime = TimeSpan.FromMinutes(10),
        MaxConnectionsPerServer = 20  // bulkhead: max 20 concurrent connections to Inventory
    });

services.AddHttpClient<IPaymentClient>()
    .ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
    {
        MaxConnectionsPerServer = 10  // Payment gets its own pool
    });

The Resilience Pipeline Order

Wrapping all four patterns in a pipeline:

C#
pipeline
    .AddTimeout(TimeSpan.FromSeconds(3))         // 1. outer timeout — total budget
    .AddRetry(retryOptions)                      // 2. retry on transient failure
    .AddCircuitBreaker(circuitBreakerOptions)    // 3. open circuit if downstream is broken
    .AddTimeout(TimeSpan.FromSeconds(1));        // 4. inner timeout — per attempt

Order matters. Retry wraps circuit breaker — retries only happen if the circuit is closed. Inner timeout limits each attempt; outer timeout limits the total operation.


Error Handling Model

The Permanent vs Transient Distinction

Every error in a distributed system is either:

Transient: will likely succeed if retried (network timeout, 503 from pod restart, connection reset)
Permanent: retrying will not help (invalid data, business rule violation, resource not found, auth failure)

Your error handling must classify errors before deciding what to do:

C#
public static class ErrorClassifier
{
    public static bool IsTransient(Exception ex) => ex switch
    {
        TimeoutException          => true,
        HttpRequestException http => IsTransientStatusCode(http),
        SqlException sql          => IsTransientSqlError(sql.Number),
        OperationCanceledException => false,
        _                         => false
    };

    private static bool IsTransientStatusCode(HttpRequestException ex) =>
        ex.StatusCode is HttpStatusCode.ServiceUnavailable
                      or HttpStatusCode.TooManyRequests
                      or HttpStatusCode.GatewayTimeout;

    private static bool IsTransientSqlError(int errorNumber) =>
        errorNumber is 1205 or 40613 or 49918;  // deadlock, DB unavailable, etc.
}

Problem Details: Consistent Error Responses

APIs must return consistent, machine-readable errors. RFC 9457 (Problem Details) is the standard:

JSON
{
  "type": "https://errors.myapi.com/order/insufficient-stock",
  "title": "Insufficient stock",
  "status": 422,
  "detail": "Product P-123 has 2 units in stock; 5 were requested.",
  "instance": "/api/orders/ORD-456",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
}

The traceId in the error response is the key link between what the user sees and what you see in Application Insights. Every error response must include it.


The Senior-Level Checklist

This is what distinguishes production-ready from demo-ready:

Testing

  • [ ] Domain logic covered by fast unit tests
  • [ ] Every API endpoint has at least one integration test (happy path + primary error cases)
  • [ ] Testcontainers for database — no mocked repositories
  • [ ] DB state reset between tests (transactions or truncate)
  • [ ] Contract tests for cross-service API dependencies
  • [ ] Messaging tested with in-memory harness or broker container

Observability

  • [ ] Structured logging (Serilog/OpenTelemetry) — no string interpolation in log calls
  • [ ] CorrelationId propagated in all logs, all events, all error responses
  • [ ] PII never in logs — pseudonymised at logging boundary
  • [ ] Distributed tracing wired up (OpenTelemetry → Application Insights / Jaeger)
  • [ ] Custom spans for business operations
  • [ ] SLIs defined and measured (error rate, P99 latency, throughput)
  • [ ] SLO targets documented and dashboarded
  • [ ] Alerts on symptoms (error rate, latency) not causes (CPU, memory)
  • [ ] DLQ depth alerting on all queues

Resilience

  • [ ] Retry with exponential backoff + jitter on all outbound HTTP and messaging calls
  • [ ] No retry on permanent errors (4xx, business failures)
  • [ ] Circuit breaker on every downstream service dependency
  • [ ] Explicit timeouts at every layer (HTTP client, DB command, Function timeout)
  • [ ] Separate HttpClient per downstream service (bulkhead via connection pool isolation)
  • [ ] Idempotency enforced on all consumer handlers

Interview Questions You Will Be Asked

"What is the difference between a circuit breaker and a retry?" Retry handles transient failures by repeating the operation — the downstream is expected to recover quickly. Circuit breaker handles persistent downstream failure — after a threshold of failures, it stops making calls entirely (fails fast) to prevent thread exhaustion and allow the downstream to recover. They are complementary: retry for blips, circuit breaker for prolonged outages.

"What is observability and how is it different from monitoring?" Monitoring tells you whether predefined checks are passing. Observability lets you answer arbitrary questions about system behaviour from its outputs — without having instrumented for those specific questions in advance. Monitoring: "is the server up?" Observability: "why is this one user's requests taking 3× longer than everyone else's?"

"How do you test event-driven systems?" Unit-test domain logic in isolation. Integration-test each service with a real broker (or in-memory harness) — verify the correct events are published and consumed. Contract-test the event schema between producer and consumer. End-to-end test only the critical business flows. The key challenge is async assertions — tests must wait for events to propagate rather than asserting immediately.

"What is an SLO and how does it relate to error budget?" An SLO is a target for a service level indicator — "99.9% of requests succeed in 30 days." Error budget is the allowable downtime derived from the SLO: 0.1% of 30 days = 43 minutes. Error budget makes reliability concrete: when it is burning fast, reliability work takes priority over features. When the budget is healthy, teams can take more risk.

"How do you handle cascading failures?" Prevent thread exhaustion with timeouts and bulkheads. Prevent retry storms with exponential backoff + jitter. Prevent extended degradation with circuit breakers. At the architecture level: prefer async decoupled communication (Service Bus, Kafka) so that downstream unavailability queues work rather than propagating failure upstream.


Related: Microservices Resilience Patterns — Polly v8 deep dive
Related: Integration Testing with WebApplicationFactory
Related: Observability in Distributed Systems
Related: Event-Driven Architecture Deep Dive

Enjoyed this article?

Explore the System Design learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.