System Design · Lesson 17 of 26

Resilience Patterns — Circuit Breaker, Retry & Bulkhead

Why Resilience Matters

In a monolith, a slow database call blocks one thread. In a microservices system, that same slow call can cascade across a dozen services, taking the entire platform down. This is called a cascading failure.

The classic scenario:

User → API Gateway → Order Service → Inventory Service (slow/down)
                                   ↑
                         Threads pile up here.
                         Connection pool exhausts.
                         Order Service starts timing out.
                         Gateway starts timing out.
                         All users affected.

The thundering herd makes it worse: once Inventory Service comes back up, every retry that was queued fires at once, immediately overloading it again.

Resilience patterns exist to contain failures, not eliminate them. The goal is partial degradation — the order service keeps working even when inventory is slow, and inventory service recovers gracefully when the storm passes.

The Resilience Patterns

| Pattern | Problem it solves | |---------|-------------------| | Retry | Transient failures (network blip, pod restart) | | Circuit Breaker | Cascading failures from a persistently unhealthy downstream | | Bulkhead | One slow downstream exhausting shared resources (thread pool, connections) | | Timeout | Calls that hang forever, blocking threads | | Hedging | Latency tail — slow P99 responses hurting user experience |

Pattern 1: Retry with Exponential Backoff + Jitter

Why plain retry makes things worse

Suppose 1,000 clients all get a 503 at time T and all retry after exactly 1 second. At T+1 you now have 1,000 requests hitting a recovering service simultaneously — the thundering herd. The service goes down again.

Exponential backoff spreads retries out over time: 1s, 2s, 4s, 8s...

Jitter adds randomness so clients don't all fire at the same moment: each client picks a random delay within the backoff window.

Attempt 1: wait 0.8s  (1s * random 0.5–1.0)
Attempt 2: wait 1.6s  (2s * random 0.5–1.0)
Attempt 3: wait 3.2s  (4s * random 0.5–1.0)

When to retry vs when not to

Retry only transient failures:

HTTP 408 (Request Timeout), 429 (Rate Limited), 503 (Service Unavailable), 504 (Gateway Timeout)
Network exceptions: HttpRequestException, SocketException

Never retry:

HTTP 400 (Bad Request) — retrying won't fix a validation error
HTTP 401/403 — retrying won't fix an auth error
HTTP 404 — the resource doesn't exist, retrying is pointless
Idempotency concerns — only retry if the operation is safe to repeat

Polly v8 / .NET 8 Retry

// Program.cs — register with DI
builder.Services.AddHttpClient<InventoryClient>()
    .AddResilienceHandler("inventory-pipeline", pipeline =>
    {
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromSeconds(1),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,                          // adds ±25% random jitter
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(r => r.StatusCode is
                    HttpStatusCode.ServiceUnavailable or
                    HttpStatusCode.TooManyRequests or
                    HttpStatusCode.GatewayTimeout),
            OnRetry = args =>
            {
                logger.LogWarning(
                    "Retry {Attempt} after {Delay}ms. Reason: {Reason}",
                    args.AttemptNumber,
                    args.RetryDelay.TotalMilliseconds,
                    args.Outcome.Exception?.Message ?? args.Outcome.Result?.StatusCode.ToString());
                return ValueTask.CompletedTask;
            },
        });
    });

Pattern 2: Circuit Breaker

The circuit breaker sits between the caller and the downstream service. It monitors failures and, when the failure rate crosses a threshold, opens the circuit — subsequent calls fail fast (no network call made) and return a fallback immediately.

State machine

             failure rate > threshold
CLOSED ─────────────────────────────────> OPEN
  │                                          │
  │ success                     probe timeout│
  │                                          ▼
  └──────────────────────────────────── HALF-OPEN
       first probe call succeeds               │
       (reset failure count)                   │ probe call fails
                                               ▼
                                             OPEN (reset timer)

CLOSED — normal operation, calls go through. Failures are counted in a sliding window.

OPEN — circuit is tripped. All calls fail immediately with BrokenCircuitException — no network calls made. A timer starts.

HALF-OPEN — after the timer expires, one probe call is allowed through. If it succeeds, the circuit resets to CLOSED. If it fails, it returns to OPEN.

Circuit Breaker in .NET 8

pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
    // Trip when 50% of calls in the last 10 seconds fail
    FailureRatio = 0.5,
    SamplingDuration = TimeSpan.FromSeconds(10),
    MinimumThroughput = 5,              // need at least 5 calls to trip

    // Stay open for 30 seconds before probing
    BreakDuration = TimeSpan.FromSeconds(30),

    ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
        .Handle<HttpRequestException>()
        .HandleResult(r => r.StatusCode == HttpStatusCode.ServiceUnavailable),

    OnOpened = args =>
    {
        logger.LogError(
            "Circuit opened for {Duration}s. Last failure: {Reason}",
            args.BreakDuration.TotalSeconds,
            args.Outcome.Exception?.Message);
        return ValueTask.CompletedTask;
    },
    OnClosed = args =>
    {
        logger.LogInformation("Circuit closed — service recovered.");
        return ValueTask.CompletedTask;
    },
    OnHalfOpened = args =>
    {
        logger.LogInformation("Circuit half-open — probing...");
        return ValueTask.CompletedTask;
    },
});

Fallback when the circuit is open

// Return a cached/default response when the circuit is open
pipeline.AddFallback(new FallbackStrategyOptions<HttpResponseMessage>
{
    ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
        .Handle<BrokenCircuitException>()
        .Handle<IsolationRejectedException>(),   // from bulkhead
    FallbackAction = args =>
    {
        // Return a degraded response instead of an error
        var response = new HttpResponseMessage(HttpStatusCode.OK)
        {
            Content = JsonContent.Create(new { available = false, reason = "service_unavailable" })
        };
        return Outcome.FromResultAsValueTask(response);
    },
});

Pattern 3: Bulkhead (Thread Pool Isolation)

Named after the watertight compartments in a ship's hull — if one compartment floods, the others stay dry.

Without a bulkhead, all outbound HTTP calls from Order Service share one HttpClient connection pool. If Inventory Service becomes slow, it fills up the connection pool, and now calls to Catalog Service (completely unrelated) start failing too.

A bulkhead limits the concurrent calls to a downstream service. Excess calls are rejected immediately (fail fast) rather than queueing up and consuming resources.

pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
{
    // Max 10 concurrent calls to this service
    PermitLimit = 10,

    // Queue up to 5 more while waiting for a permit
    QueueLimit = 5,
});

When the permit limit is hit, additional calls throw IsolationRejectedException — pair this with the fallback pattern above.

Pattern 4: Timeout

Every outbound call needs a timeout. Without one, a hung downstream service will hold a thread indefinitely.

Two timeout levels to consider:

Per-attempt timeout — how long a single attempt (including retries) can take.

Overall timeout — the total time budget across all retry attempts.

// Per-attempt: each individual call gets 2 seconds
pipeline.AddTimeout(TimeSpan.FromSeconds(2));

// Combine with retry — the overall budget is ~10s across all attempts
// (2s * 3 retries + backoff time)

Set the per-attempt timeout shorter than the upstream caller's timeout. If your API gateway has a 10-second timeout, set the internal service call timeout to 3 seconds so retries can fit within the budget.

Pattern 5: Hedging

Hedging is an advanced latency-reduction technique. Instead of waiting for a slow response, send a second request to a different instance (or the same endpoint) after a delay. Use whichever response arrives first.

Request 1 ───────────────────────────────> slow...
                         │
           hedge delay   │
           (e.g. 200ms)  ▼
Request 2 ──────────────────> fast response ✓
                              (cancel Request 1)

Use hedging when:

You have multiple healthy replicas behind a load balancer
Your P99 latency is significantly higher than your P50
The operation is idempotent (safe to call twice)

pipeline.AddHedging(new HedgingStrategyOptions<HttpResponseMessage>
{
    MaxHedgedAttempts = 2,
    Delay = TimeSpan.FromMilliseconds(200),   // send hedge after 200ms
    ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
        .Handle<HttpRequestException>()
        .HandleResult(r => !r.IsSuccessStatusCode),
});

Full Resilience Pipeline

The order of patterns in a pipeline matters. The recommended order is:

Request → Timeout → Retry → Circuit Breaker → Bulkhead → Fallback → Service

Read it outside-in: the timeout wraps everything (enforces the overall budget), retry fires on transient failures, circuit breaker stops calls when the service is down, bulkhead limits concurrency, fallback catches anything that slips through.

// appsettings.json
{
  "Resilience": {
    "Inventory": {
      "TimeoutSeconds": 8,
      "RetryCount": 3,
      "CircuitBreakerFailureRatio": 0.5,
      "CircuitBreakerBreakSeconds": 30,
      "BulkheadPermitLimit": 10
    }
  }
}

// ResilienceExtensions.cs
public static IHttpClientBuilder AddInventoryResilience(
    this IHttpClientBuilder builder,
    IConfiguration config)
{
    var section = config.GetSection("Resilience:Inventory");

    return builder.AddResilienceHandler("inventory", pipeline =>
    {
        // 1. Overall timeout
        pipeline.AddTimeout(TimeSpan.FromSeconds(section.GetValue<int>("TimeoutSeconds")));

        // 2. Retry with jitter
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = section.GetValue<int>("RetryCount"),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            Delay = TimeSpan.FromSeconds(1),
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(r => (int)r.StatusCode >= 500 || r.StatusCode == HttpStatusCode.TooManyRequests),
        });

        // 3. Circuit breaker
        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = section.GetValue<double>("CircuitBreakerFailureRatio"),
            SamplingDuration = TimeSpan.FromSeconds(10),
            MinimumThroughput = 5,
            BreakDuration = TimeSpan.FromSeconds(
                section.GetValue<int>("CircuitBreakerBreakSeconds")),
        });

        // 4. Bulkhead
        pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = section.GetValue<int>("BulkheadPermitLimit"),
            QueueLimit = 5,
        });

        // 5. Fallback for open circuit or rejected calls
        pipeline.AddFallback(new FallbackStrategyOptions<HttpResponseMessage>
        {
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<BrokenCircuitException>()
                .Handle<IsolationRejectedException>(),
            FallbackAction = _ => Outcome.FromResultAsValueTask(
                new HttpResponseMessage(HttpStatusCode.ServiceUnavailable)),
        });
    });
}

Testing Resilience with Chaos Engineering (Simmy)

Writing a resilience pipeline is not enough — you need to verify it works under failure. Simmy is the chaos engineering extension for Polly. It lets you inject faults into your pipeline in tests or even in staging.

Bash

dotnet add package Polly.Simmy

// Test: verify circuit breaker opens after 5 failures
[Fact]
public async Task CircuitBreaker_OpensAfterThreshold_ReturnsFallback()
{
    // Arrange — inject a fault that returns 503 on every call
    var chaosOptions = new HttpChaosStrategyOptions
    {
        InjectionRate = 1.0,   // 100% fault injection
        Enabled = true,
        StatusCode = HttpStatusCode.ServiceUnavailable,
    };

    var pipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
        .AddChaosResult(chaosOptions)      // inject faults first
        .AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            MinimumThroughput = 3,
            SamplingDuration = TimeSpan.FromSeconds(5),
            BreakDuration = TimeSpan.FromSeconds(10),
        })
        .Build();

    // Act — exhaust the circuit
    for (int i = 0; i < 5; i++)
    {
        await pipeline.ExecuteAsync(ct => ValueTask.FromResult(
            new HttpResponseMessage(HttpStatusCode.OK)), CancellationToken.None);
    }

    // Assert — circuit is now open
    var result = await pipeline.ExecuteAsync(ct => ValueTask.FromResult(
        new HttpResponseMessage(HttpStatusCode.OK)), CancellationToken.None);

    result.StatusCode.Should().Be(HttpStatusCode.ServiceUnavailable);
}

Chaos in staging environments

For staging, use a feature flag to enable/disable chaos injection:

pipeline.AddChaosLatency(new ChaosLatencyStrategyOptions
{
    InjectionRate = 0.1,    // 10% of calls get 2s delay
    Latency = TimeSpan.FromSeconds(2),
    EnabledGenerator = args =>
    {
        // Only inject chaos if the feature flag is on
        var featureManager = args.Context.ServiceProvider
            .GetRequiredService<IFeatureManager>();
        return new ValueTask<bool>(
            featureManager.IsEnabledAsync("chaos-latency").Result);
    },
});

MicroMart: Order Service Circuit-Breaks Against Inventory

In MicroMart, the Order Service calls Inventory Service to reserve stock during order placement. If Inventory is slow or down, the circuit breaker protects Order Service from being dragged down.

// services/orders/Infrastructure/Http/InventoryClient.cs
public class InventoryClient(HttpClient httpClient, ILogger<InventoryClient> logger)
{
    public async Task<ReservationResult> ReserveStockAsync(
        Guid productId, int quantity, CancellationToken ct)
    {
        try
        {
            var response = await httpClient.PostAsJsonAsync(
                "/api/inventory/reserve",
                new { productId, quantity },
                ct);

            if (response.StatusCode == HttpStatusCode.ServiceUnavailable)
            {
                // Circuit open — return degraded response
                logger.LogWarning("Inventory service unavailable — deferring reservation.");
                return ReservationResult.Deferred("inventory_unavailable");
            }

            response.EnsureSuccessStatusCode();
            return await response.Content.ReadFromJsonAsync<ReservationResult>(ct)
                ?? throw new InvalidOperationException("Empty reservation response.");
        }
        catch (BrokenCircuitException ex)
        {
            logger.LogWarning(ex, "Circuit open — skipping inventory reservation.");
            return ReservationResult.Deferred("circuit_open");
        }
    }
}

// Program.cs — wire up the pipeline
builder.Services.AddHttpClient<InventoryClient>(client =>
    client.BaseAddress = new Uri(builder.Configuration["Services:Inventory"]!))
    .AddInventoryResilience(builder.Configuration);

With this setup:

Transient Inventory failures are retried (up to 3 times with jitter)
After 50% failure rate over 10 seconds, the circuit opens
Open circuit: orders are accepted with deferred stock reservation (saga compensates later)
After 30 seconds, the circuit probes — if Inventory is healthy, it resets
Max 10 concurrent calls to Inventory (bulkhead prevents pool exhaustion)

Summary

| Pattern | When to use | Key config | |---------|-------------|------------| | Retry | Transient failures (5xx, timeouts) | Exponential backoff + jitter, 3 attempts | | Circuit Breaker | Persistent downstream failures | 50% failure rate, 30s break, min 5 calls | | Bulkhead | One downstream exhausting shared resources | 10 permit limit, 5 queue | | Timeout | Calls that can hang indefinitely | Per-attempt shorter than upstream timeout | | Hedging | High P99 latency, idempotent calls | 200ms hedge delay |

Resilience is not optional in production microservices. Every service-to-service HTTP call should have at minimum a timeout, retry, and circuit breaker.

Service Discovery & API Gateway Patterns

Next Lesson

Security in Microservices — mTLS, JWT & Zero Trust