.NET Aspire · Lesson 3 of 5
Resilience in .NET Aspire — Retries and Circuit Breakers
What AddServiceDefaults Provides
// In each service's Program.cs:
builder.AddServiceDefaults();
// What this wires up automatically:
// → OpenTelemetry tracing, metrics, and logging
// → Health check endpoints (/health/live, /health/ready)
// → Service discovery resolver for HttpClient
// → Default resilience pipeline for named HttpClients
// ServiceDefaults is in the ServiceDefaults shared project generated by Aspire templates
// Customise it to your needs — it's your code, not a libraryDefault Resilience Pipeline
// Aspire's default resilience pipeline (from ServiceDefaults):
// NuGet: Microsoft.Extensions.Http.Resilience
// What's included by default:
builder.Services.AddHttpClient<IPatientServiceClient, PatientServiceClient>(...)
.AddStandardResilienceHandler();
// StandardResilienceHandler includes (in order):
// 1. RateLimiter: max 100 concurrent requests
// 2. TotalRequestTimeout: 30 seconds total
// 3. Retry: up to 3 retries on transient failures (408, 429, 5xx)
// exponential backoff with jitter
// 4. CircuitBreaker: opens after 10 failures in 1 minute
// 5. AttemptTimeout: 10 seconds per individual attempt
// These defaults are sensible for most services.
// Customise for specific clinical endpoints that have tighter SLAs.Customising Retry for Clinical Services
// Configure retry specifically for the FHIR patient service
// INR checks must complete or fail fast — no long retry loops
builder.Services.AddHttpClient<IFhirPatientClient, FhirPatientClient>(
client => client.BaseAddress = new Uri("https+http://patient-service"))
.AddResilienceHandler("fhir-patient", resilienceBuilder =>
{
// Retry: up to 2 retries on transient errors (fewer than default — fast fail)
resilienceBuilder.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 2,
Delay = TimeSpan.FromMilliseconds(200),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
ShouldHandle = args =>
ValueTask.FromResult(
args.Outcome.Exception is HttpRequestException ||
(args.Outcome.Result?.StatusCode is
HttpStatusCode.RequestTimeout or
HttpStatusCode.TooManyRequests or
HttpStatusCode.ServiceUnavailable))
});
// Circuit breaker: open after 5 failures in 30 seconds
resilienceBuilder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
FailureRatio = 0.5, // open when 50% of requests fail
SamplingDuration = TimeSpan.FromSeconds(30),
MinimumThroughput = 5,
BreakDuration = TimeSpan.FromSeconds(15)
});
// Total timeout: 3 seconds for FHIR lookups (clinical SLA)
resilienceBuilder.AddTimeout(TimeSpan.FromSeconds(3));
});Circuit Breaker Behaviour
Closed state (normal):
→ All requests pass through
→ Failure rate is tracked over SamplingDuration
Open state (tripped):
→ All requests fail immediately with BrokenCircuitException
→ No requests reach the downstream service
→ Remains open for BreakDuration
Half-open state (testing):
→ After BreakDuration, one probe request is allowed through
→ If probe succeeds: circuit closes (back to normal)
→ If probe fails: circuit opens again for BreakDuration
Clinical impact:
→ If FHIR patient service is down, circuit opens after 5 failures
→ For 15 seconds, prescription lookups fail fast (no timeout delays)
→ After 15 seconds, one probe request tests if FHIR is back up
→ Prevents cascading failure: PrescriptionService doesn't bog down
waiting for timeouts from a dead PatientService
Monitor circuit state:
→ Use OpenTelemetry metrics to track circuit breaker state changes
→ Alert when circuit opens for more than 60 seconds (upstream is seriously degraded)Hedging for Low-Latency Clinical Reads
// Hedging: send a second request if the first doesn't respond within a threshold
// Use when: request latency must be low and occasional slowness is unacceptable
// Example: INR check must complete in under 500ms for clinical workflow
builder.Services.AddHttpClient<ILabResultsClient, LabResultsClient>(...)
.AddResilienceHandler("lab-results-hedged", builder =>
{
builder.AddHedging(new HttpHedgingStrategyOptions
{
MaxHedgedAttempts = 2,
Delay = TimeSpan.FromMilliseconds(300),
// After 300ms with no response, send a second parallel request
// First response to come back wins; other is cancelled
});
builder.AddTimeout(TimeSpan.FromMilliseconds(800)); // hard limit
});
// Trade-off:
// + p50 latency unchanged, p99 latency dramatically reduced
// - Doubles request load on LabResults service during slow periods
// Use hedging only for read-only, idempotent endpointsRate Limiting Outbound Calls
// Prevent your service from overwhelming a downstream with too many concurrent calls
builder.Services.AddHttpClient<IMhraReportingClient, MhraReportingClient>(...)
.AddResilienceHandler("mhra-rate-limited", builder =>
{
// Max 10 concurrent calls to MHRA (they have rate limits in their SLA)
builder.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
{
PermitLimit = 10,
QueueLimit = 50
});
builder.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 3,
// Retry on 429 Too Many Requests from MHRA:
ShouldHandle = args =>
ValueTask.FromResult(
args.Outcome.Result?.StatusCode == HttpStatusCode.TooManyRequests)
});
});Production issue I've seen: A prescription service was calling a patient demographics service synchronously. The demographics service deployed a slow migration and response times went from 50ms to 8 seconds. Without a circuit breaker, every prescription request held a thread for 8 seconds. Within 2 minutes, the prescription service's thread pool was exhausted — it stopped responding to all requests, not just the prescription ones. The entire clinical platform went down because of a slow migration in one service. Adding a circuit breaker with a 3-second timeout would have isolated the failure: prescriptions would have failed with "patient service unavailable" (a graceful degradation) while ward lookup, lab results, and billing continued unaffected.
Key Takeaway
Aspire's
AddServiceDefaultswires a standard resilience pipeline (retry, circuit breaker, timeouts) for all HTTP clients automatically. Customise per-client withAddResilienceHandlerfor clinical SLAs: fewer retries for time-critical endpoints, shorter timeouts for real-time clinical workflows. Circuit breakers prevent cascading failures — when one service is slow, the circuit opens and other services fast-fail instead of exhausting thread pools. Use hedging for read-only low-latency critical paths. Rate limit outbound calls to external services with known rate limits (MHRA, FHIR registries).