Resilience in .NET Aspire — Retries and Circuit Breakers — .NET Aspire | Learnixo

What AddServiceDefaults Provides

// In each service's Program.cs:
builder.AddServiceDefaults();

// What this wires up automatically:
//  → OpenTelemetry tracing, metrics, and logging
//  → Health check endpoints (/health/live, /health/ready)
//  → Service discovery resolver for HttpClient
//  → Default resilience pipeline for named HttpClients

// ServiceDefaults is in the ServiceDefaults shared project generated by Aspire templates
// Customise it to your needs — it's your code, not a library

Default Resilience Pipeline

// Aspire's default resilience pipeline (from ServiceDefaults):
// NuGet: Microsoft.Extensions.Http.Resilience

// What's included by default:
builder.Services.AddHttpClient<IPatientServiceClient, PatientServiceClient>(...)
    .AddStandardResilienceHandler();

// StandardResilienceHandler includes (in order):
//  1. RateLimiter:     max 100 concurrent requests
//  2. TotalRequestTimeout: 30 seconds total
//  3. Retry:           up to 3 retries on transient failures (408, 429, 5xx)
//                      exponential backoff with jitter
//  4. CircuitBreaker:  opens after 10 failures in 1 minute
//  5. AttemptTimeout:  10 seconds per individual attempt

// These defaults are sensible for most services.
// Customise for specific clinical endpoints that have tighter SLAs.

Customising Retry for Clinical Services

// Configure retry specifically for the FHIR patient service
// INR checks must complete or fail fast — no long retry loops

builder.Services.AddHttpClient<IFhirPatientClient, FhirPatientClient>(
    client => client.BaseAddress = new Uri("https+http://patient-service"))
    .AddResilienceHandler("fhir-patient", resilienceBuilder =>
    {
        // Retry: up to 2 retries on transient errors (fewer than default — fast fail)
        resilienceBuilder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts  = 2,
            Delay             = TimeSpan.FromMilliseconds(200),
            BackoffType       = DelayBackoffType.Exponential,
            UseJitter         = true,
            ShouldHandle      = args =>
                ValueTask.FromResult(
                    args.Outcome.Exception is HttpRequestException ||
                    (args.Outcome.Result?.StatusCode is
                        HttpStatusCode.RequestTimeout or
                        HttpStatusCode.TooManyRequests or
                        HttpStatusCode.ServiceUnavailable))
        });

        // Circuit breaker: open after 5 failures in 30 seconds
        resilienceBuilder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio        = 0.5,       // open when 50% of requests fail
            SamplingDuration    = TimeSpan.FromSeconds(30),
            MinimumThroughput   = 5,
            BreakDuration       = TimeSpan.FromSeconds(15)
        });

        // Total timeout: 3 seconds for FHIR lookups (clinical SLA)
        resilienceBuilder.AddTimeout(TimeSpan.FromSeconds(3));
    });

Circuit Breaker Behaviour

Closed state (normal):
  → All requests pass through
  → Failure rate is tracked over SamplingDuration

Open state (tripped):
  → All requests fail immediately with BrokenCircuitException
  → No requests reach the downstream service
  → Remains open for BreakDuration

Half-open state (testing):
  → After BreakDuration, one probe request is allowed through
  → If probe succeeds: circuit closes (back to normal)
  → If probe fails: circuit opens again for BreakDuration

Clinical impact:
  → If FHIR patient service is down, circuit opens after 5 failures
  → For 15 seconds, prescription lookups fail fast (no timeout delays)
  → After 15 seconds, one probe request tests if FHIR is back up
  → Prevents cascading failure: PrescriptionService doesn't bog down
    waiting for timeouts from a dead PatientService

Monitor circuit state:
  → Use OpenTelemetry metrics to track circuit breaker state changes
  → Alert when circuit opens for more than 60 seconds (upstream is seriously degraded)

Hedging for Low-Latency Clinical Reads

// Hedging: send a second request if the first doesn't respond within a threshold
// Use when: request latency must be low and occasional slowness is unacceptable

// Example: INR check must complete in under 500ms for clinical workflow
builder.Services.AddHttpClient<ILabResultsClient, LabResultsClient>(...)
    .AddResilienceHandler("lab-results-hedged", builder =>
    {
        builder.AddHedging(new HttpHedgingStrategyOptions
        {
            MaxHedgedAttempts = 2,
            Delay             = TimeSpan.FromMilliseconds(300),
            // After 300ms with no response, send a second parallel request
            // First response to come back wins; other is cancelled
        });

        builder.AddTimeout(TimeSpan.FromMilliseconds(800)); // hard limit
    });

// Trade-off:
// + p50 latency unchanged, p99 latency dramatically reduced
// - Doubles request load on LabResults service during slow periods
// Use hedging only for read-only, idempotent endpoints

Rate Limiting Outbound Calls

// Prevent your service from overwhelming a downstream with too many concurrent calls

builder.Services.AddHttpClient<IMhraReportingClient, MhraReportingClient>(...)
    .AddResilienceHandler("mhra-rate-limited", builder =>
    {
        // Max 10 concurrent calls to MHRA (they have rate limits in their SLA)
        builder.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 10,
            QueueLimit  = 50
        });

        builder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            // Retry on 429 Too Many Requests from MHRA:
            ShouldHandle = args =>
                ValueTask.FromResult(
                    args.Outcome.Result?.StatusCode == HttpStatusCode.TooManyRequests)
        });
    });

Production issue I've seen: A prescription service was calling a patient demographics service synchronously. The demographics service deployed a slow migration and response times went from 50ms to 8 seconds. Without a circuit breaker, every prescription request held a thread for 8 seconds. Within 2 minutes, the prescription service's thread pool was exhausted — it stopped responding to all requests, not just the prescription ones. The entire clinical platform went down because of a slow migration in one service. Adding a circuit breaker with a 3-second timeout would have isolated the failure: prescriptions would have failed with "patient service unavailable" (a graceful degradation) while ward lookup, lab results, and billing continued unaffected.

Key Takeaway

Aspire's AddServiceDefaults wires a standard resilience pipeline (retry, circuit breaker, timeouts) for all HTTP clients automatically. Customise per-client with AddResilienceHandler for clinical SLAs: fewer retries for time-critical endpoints, shorter timeouts for real-time clinical workflows. Circuit breakers prevent cascading failures — when one service is slow, the circuit opens and other services fast-fail instead of exhausting thread pools. Use hedging for read-only low-latency critical paths. Rate limit outbound calls to external services with known rate limits (MHRA, FHIR registries).