System Design · Lesson 17 of 26
Resilience Patterns — Circuit Breaker, Retry & Bulkhead
Why Resilience Matters
In a monolith, a slow database call blocks one thread. In a microservices system, that same slow call can cascade across a dozen services, taking the entire platform down. This is called a cascading failure.
The classic scenario:
User → API Gateway → Order Service → Inventory Service (slow/down)
↑
Threads pile up here.
Connection pool exhausts.
Order Service starts timing out.
Gateway starts timing out.
All users affected.The thundering herd makes it worse: once Inventory Service comes back up, every retry that was queued fires at once, immediately overloading it again.
Resilience patterns exist to contain failures, not eliminate them. The goal is partial degradation — the order service keeps working even when inventory is slow, and inventory service recovers gracefully when the storm passes.
The Resilience Patterns
| Pattern | Problem it solves | |---------|-------------------| | Retry | Transient failures (network blip, pod restart) | | Circuit Breaker | Cascading failures from a persistently unhealthy downstream | | Bulkhead | One slow downstream exhausting shared resources (thread pool, connections) | | Timeout | Calls that hang forever, blocking threads | | Hedging | Latency tail — slow P99 responses hurting user experience |
Pattern 1: Retry with Exponential Backoff + Jitter
Why plain retry makes things worse
Suppose 1,000 clients all get a 503 at time T and all retry after exactly 1 second. At T+1 you now have 1,000 requests hitting a recovering service simultaneously — the thundering herd. The service goes down again.
Exponential backoff spreads retries out over time: 1s, 2s, 4s, 8s...
Jitter adds randomness so clients don't all fire at the same moment: each client picks a random delay within the backoff window.
Attempt 1: wait 0.8s (1s * random 0.5–1.0)
Attempt 2: wait 1.6s (2s * random 0.5–1.0)
Attempt 3: wait 3.2s (4s * random 0.5–1.0)When to retry vs when not to
Retry only transient failures:
- HTTP 408 (Request Timeout), 429 (Rate Limited), 503 (Service Unavailable), 504 (Gateway Timeout)
- Network exceptions:
HttpRequestException,SocketException
Never retry:
- HTTP 400 (Bad Request) — retrying won't fix a validation error
- HTTP 401/403 — retrying won't fix an auth error
- HTTP 404 — the resource doesn't exist, retrying is pointless
- Idempotency concerns — only retry if the operation is safe to repeat
Polly v8 / .NET 8 Retry
// Program.cs — register with DI
builder.Services.AddHttpClient<InventoryClient>()
.AddResilienceHandler("inventory-pipeline", pipeline =>
{
pipeline.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromSeconds(1),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true, // adds ±25% random jitter
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => r.StatusCode is
HttpStatusCode.ServiceUnavailable or
HttpStatusCode.TooManyRequests or
HttpStatusCode.GatewayTimeout),
OnRetry = args =>
{
logger.LogWarning(
"Retry {Attempt} after {Delay}ms. Reason: {Reason}",
args.AttemptNumber,
args.RetryDelay.TotalMilliseconds,
args.Outcome.Exception?.Message ?? args.Outcome.Result?.StatusCode.ToString());
return ValueTask.CompletedTask;
},
});
});Pattern 2: Circuit Breaker
The circuit breaker sits between the caller and the downstream service. It monitors failures and, when the failure rate crosses a threshold, opens the circuit — subsequent calls fail fast (no network call made) and return a fallback immediately.
State machine
failure rate > threshold
CLOSED ─────────────────────────────────> OPEN
│ │
│ success probe timeout│
│ ▼
└──────────────────────────────────── HALF-OPEN
first probe call succeeds │
(reset failure count) │ probe call fails
▼
OPEN (reset timer)CLOSED — normal operation, calls go through. Failures are counted in a sliding window.
OPEN — circuit is tripped. All calls fail immediately with BrokenCircuitException — no network calls made. A timer starts.
HALF-OPEN — after the timer expires, one probe call is allowed through. If it succeeds, the circuit resets to CLOSED. If it fails, it returns to OPEN.
Circuit Breaker in .NET 8
pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
// Trip when 50% of calls in the last 10 seconds fail
FailureRatio = 0.5,
SamplingDuration = TimeSpan.FromSeconds(10),
MinimumThroughput = 5, // need at least 5 calls to trip
// Stay open for 30 seconds before probing
BreakDuration = TimeSpan.FromSeconds(30),
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => r.StatusCode == HttpStatusCode.ServiceUnavailable),
OnOpened = args =>
{
logger.LogError(
"Circuit opened for {Duration}s. Last failure: {Reason}",
args.BreakDuration.TotalSeconds,
args.Outcome.Exception?.Message);
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
logger.LogInformation("Circuit closed — service recovered.");
return ValueTask.CompletedTask;
},
OnHalfOpened = args =>
{
logger.LogInformation("Circuit half-open — probing...");
return ValueTask.CompletedTask;
},
});Fallback when the circuit is open
// Return a cached/default response when the circuit is open
pipeline.AddFallback(new FallbackStrategyOptions<HttpResponseMessage>
{
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<BrokenCircuitException>()
.Handle<IsolationRejectedException>(), // from bulkhead
FallbackAction = args =>
{
// Return a degraded response instead of an error
var response = new HttpResponseMessage(HttpStatusCode.OK)
{
Content = JsonContent.Create(new { available = false, reason = "service_unavailable" })
};
return Outcome.FromResultAsValueTask(response);
},
});Pattern 3: Bulkhead (Thread Pool Isolation)
Named after the watertight compartments in a ship's hull — if one compartment floods, the others stay dry.
Without a bulkhead, all outbound HTTP calls from Order Service share one HttpClient connection pool. If Inventory Service becomes slow, it fills up the connection pool, and now calls to Catalog Service (completely unrelated) start failing too.
A bulkhead limits the concurrent calls to a downstream service. Excess calls are rejected immediately (fail fast) rather than queueing up and consuming resources.
pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
{
// Max 10 concurrent calls to this service
PermitLimit = 10,
// Queue up to 5 more while waiting for a permit
QueueLimit = 5,
});When the permit limit is hit, additional calls throw IsolationRejectedException — pair this with the fallback pattern above.
Pattern 4: Timeout
Every outbound call needs a timeout. Without one, a hung downstream service will hold a thread indefinitely.
Two timeout levels to consider:
Per-attempt timeout — how long a single attempt (including retries) can take.
Overall timeout — the total time budget across all retry attempts.
// Per-attempt: each individual call gets 2 seconds
pipeline.AddTimeout(TimeSpan.FromSeconds(2));
// Combine with retry — the overall budget is ~10s across all attempts
// (2s * 3 retries + backoff time)Set the per-attempt timeout shorter than the upstream caller's timeout. If your API gateway has a 10-second timeout, set the internal service call timeout to 3 seconds so retries can fit within the budget.
Pattern 5: Hedging
Hedging is an advanced latency-reduction technique. Instead of waiting for a slow response, send a second request to a different instance (or the same endpoint) after a delay. Use whichever response arrives first.
Request 1 ───────────────────────────────> slow...
│
hedge delay │
(e.g. 200ms) ▼
Request 2 ──────────────────> fast response ✓
(cancel Request 1)Use hedging when:
- You have multiple healthy replicas behind a load balancer
- Your P99 latency is significantly higher than your P50
- The operation is idempotent (safe to call twice)
pipeline.AddHedging(new HedgingStrategyOptions<HttpResponseMessage>
{
MaxHedgedAttempts = 2,
Delay = TimeSpan.FromMilliseconds(200), // send hedge after 200ms
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => !r.IsSuccessStatusCode),
});Full Resilience Pipeline
The order of patterns in a pipeline matters. The recommended order is:
Request → Timeout → Retry → Circuit Breaker → Bulkhead → Fallback → ServiceRead it outside-in: the timeout wraps everything (enforces the overall budget), retry fires on transient failures, circuit breaker stops calls when the service is down, bulkhead limits concurrency, fallback catches anything that slips through.
// appsettings.json
{
"Resilience": {
"Inventory": {
"TimeoutSeconds": 8,
"RetryCount": 3,
"CircuitBreakerFailureRatio": 0.5,
"CircuitBreakerBreakSeconds": 30,
"BulkheadPermitLimit": 10
}
}
}// ResilienceExtensions.cs
public static IHttpClientBuilder AddInventoryResilience(
this IHttpClientBuilder builder,
IConfiguration config)
{
var section = config.GetSection("Resilience:Inventory");
return builder.AddResilienceHandler("inventory", pipeline =>
{
// 1. Overall timeout
pipeline.AddTimeout(TimeSpan.FromSeconds(section.GetValue<int>("TimeoutSeconds")));
// 2. Retry with jitter
pipeline.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = section.GetValue<int>("RetryCount"),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
Delay = TimeSpan.FromSeconds(1),
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => (int)r.StatusCode >= 500 || r.StatusCode == HttpStatusCode.TooManyRequests),
});
// 3. Circuit breaker
pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
FailureRatio = section.GetValue<double>("CircuitBreakerFailureRatio"),
SamplingDuration = TimeSpan.FromSeconds(10),
MinimumThroughput = 5,
BreakDuration = TimeSpan.FromSeconds(
section.GetValue<int>("CircuitBreakerBreakSeconds")),
});
// 4. Bulkhead
pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
{
PermitLimit = section.GetValue<int>("BulkheadPermitLimit"),
QueueLimit = 5,
});
// 5. Fallback for open circuit or rejected calls
pipeline.AddFallback(new FallbackStrategyOptions<HttpResponseMessage>
{
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<BrokenCircuitException>()
.Handle<IsolationRejectedException>(),
FallbackAction = _ => Outcome.FromResultAsValueTask(
new HttpResponseMessage(HttpStatusCode.ServiceUnavailable)),
});
});
}Testing Resilience with Chaos Engineering (Simmy)
Writing a resilience pipeline is not enough — you need to verify it works under failure. Simmy is the chaos engineering extension for Polly. It lets you inject faults into your pipeline in tests or even in staging.
dotnet add package Polly.Simmy// Test: verify circuit breaker opens after 5 failures
[Fact]
public async Task CircuitBreaker_OpensAfterThreshold_ReturnsFallback()
{
// Arrange — inject a fault that returns 503 on every call
var chaosOptions = new HttpChaosStrategyOptions
{
InjectionRate = 1.0, // 100% fault injection
Enabled = true,
StatusCode = HttpStatusCode.ServiceUnavailable,
};
var pipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
.AddChaosResult(chaosOptions) // inject faults first
.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
FailureRatio = 0.5,
MinimumThroughput = 3,
SamplingDuration = TimeSpan.FromSeconds(5),
BreakDuration = TimeSpan.FromSeconds(10),
})
.Build();
// Act — exhaust the circuit
for (int i = 0; i < 5; i++)
{
await pipeline.ExecuteAsync(ct => ValueTask.FromResult(
new HttpResponseMessage(HttpStatusCode.OK)), CancellationToken.None);
}
// Assert — circuit is now open
var result = await pipeline.ExecuteAsync(ct => ValueTask.FromResult(
new HttpResponseMessage(HttpStatusCode.OK)), CancellationToken.None);
result.StatusCode.Should().Be(HttpStatusCode.ServiceUnavailable);
}Chaos in staging environments
For staging, use a feature flag to enable/disable chaos injection:
pipeline.AddChaosLatency(new ChaosLatencyStrategyOptions
{
InjectionRate = 0.1, // 10% of calls get 2s delay
Latency = TimeSpan.FromSeconds(2),
EnabledGenerator = args =>
{
// Only inject chaos if the feature flag is on
var featureManager = args.Context.ServiceProvider
.GetRequiredService<IFeatureManager>();
return new ValueTask<bool>(
featureManager.IsEnabledAsync("chaos-latency").Result);
},
});MicroMart: Order Service Circuit-Breaks Against Inventory
In MicroMart, the Order Service calls Inventory Service to reserve stock during order placement. If Inventory is slow or down, the circuit breaker protects Order Service from being dragged down.
// services/orders/Infrastructure/Http/InventoryClient.cs
public class InventoryClient(HttpClient httpClient, ILogger<InventoryClient> logger)
{
public async Task<ReservationResult> ReserveStockAsync(
Guid productId, int quantity, CancellationToken ct)
{
try
{
var response = await httpClient.PostAsJsonAsync(
"/api/inventory/reserve",
new { productId, quantity },
ct);
if (response.StatusCode == HttpStatusCode.ServiceUnavailable)
{
// Circuit open — return degraded response
logger.LogWarning("Inventory service unavailable — deferring reservation.");
return ReservationResult.Deferred("inventory_unavailable");
}
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync<ReservationResult>(ct)
?? throw new InvalidOperationException("Empty reservation response.");
}
catch (BrokenCircuitException ex)
{
logger.LogWarning(ex, "Circuit open — skipping inventory reservation.");
return ReservationResult.Deferred("circuit_open");
}
}
}// Program.cs — wire up the pipeline
builder.Services.AddHttpClient<InventoryClient>(client =>
client.BaseAddress = new Uri(builder.Configuration["Services:Inventory"]!))
.AddInventoryResilience(builder.Configuration);With this setup:
- Transient Inventory failures are retried (up to 3 times with jitter)
- After 50% failure rate over 10 seconds, the circuit opens
- Open circuit: orders are accepted with deferred stock reservation (saga compensates later)
- After 30 seconds, the circuit probes — if Inventory is healthy, it resets
- Max 10 concurrent calls to Inventory (bulkhead prevents pool exhaustion)
Summary
| Pattern | When to use | Key config | |---------|-------------|------------| | Retry | Transient failures (5xx, timeouts) | Exponential backoff + jitter, 3 attempts | | Circuit Breaker | Persistent downstream failures | 50% failure rate, 30s break, min 5 calls | | Bulkhead | One downstream exhausting shared resources | 10 permit limit, 5 queue | | Timeout | Calls that can hang indefinitely | Per-attempt shorter than upstream timeout | | Hedging | High P99 latency, idempotent calls | 200ms hedge delay |
Resilience is not optional in production microservices. Every service-to-service HTTP call should have at minimum a timeout, retry, and circuit breaker.