Learnixo
Back to blog
AI Systemsintermediate

Aspire Resilience — Polly Retry and Circuit Breaker Patterns

Add resilience to .NET Aspire services: Polly retry policies, circuit breakers, hedging, rate limiters, and how Aspire's AddServiceDefaults wires resilience for all HTTP clients automatically.

Asma Hafeez KhanMay 16, 20265 min read
.NET AspireResiliencePolly.NETMicroservices
Share:𝕏

What AddServiceDefaults Provides

C#
// In each service's Program.cs:
builder.AddServiceDefaults();

// What this wires up automatically:
//  → OpenTelemetry tracing, metrics, and logging
//  → Health check endpoints (/health/live, /health/ready)
//  → Service discovery resolver for HttpClient
//  → Default resilience pipeline for named HttpClients

// ServiceDefaults is in the ServiceDefaults shared project generated by Aspire templates
// Customise it to your needs — it's your code, not a library

Default Resilience Pipeline

C#
// Aspire's default resilience pipeline (from ServiceDefaults):
// NuGet: Microsoft.Extensions.Http.Resilience

// What's included by default:
builder.Services.AddHttpClient<IPatientServiceClient, PatientServiceClient>(...)
    .AddStandardResilienceHandler();

// StandardResilienceHandler includes (in order):
//  1. RateLimiter:     max 100 concurrent requests
//  2. TotalRequestTimeout: 30 seconds total
//  3. Retry:           up to 3 retries on transient failures (408, 429, 5xx)
//                      exponential backoff with jitter
//  4. CircuitBreaker:  opens after 10 failures in 1 minute
//  5. AttemptTimeout:  10 seconds per individual attempt

// These defaults are sensible for most services.
// Customise for specific clinical endpoints that have tighter SLAs.

Customising Retry for Clinical Services

C#
// Configure retry specifically for the FHIR patient service
// INR checks must complete or fail fast — no long retry loops

builder.Services.AddHttpClient<IFhirPatientClient, FhirPatientClient>(
    client => client.BaseAddress = new Uri("https+http://patient-service"))
    .AddResilienceHandler("fhir-patient", resilienceBuilder =>
    {
        // Retry: up to 2 retries on transient errors (fewer than default — fast fail)
        resilienceBuilder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts  = 2,
            Delay             = TimeSpan.FromMilliseconds(200),
            BackoffType       = DelayBackoffType.Exponential,
            UseJitter         = true,
            ShouldHandle      = args =>
                ValueTask.FromResult(
                    args.Outcome.Exception is HttpRequestException ||
                    (args.Outcome.Result?.StatusCode is
                        HttpStatusCode.RequestTimeout or
                        HttpStatusCode.TooManyRequests or
                        HttpStatusCode.ServiceUnavailable))
        });

        // Circuit breaker: open after 5 failures in 30 seconds
        resilienceBuilder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio        = 0.5,       // open when 50% of requests fail
            SamplingDuration    = TimeSpan.FromSeconds(30),
            MinimumThroughput   = 5,
            BreakDuration       = TimeSpan.FromSeconds(15)
        });

        // Total timeout: 3 seconds for FHIR lookups (clinical SLA)
        resilienceBuilder.AddTimeout(TimeSpan.FromSeconds(3));
    });

Circuit Breaker Behaviour

Closed state (normal):
  → All requests pass through
  → Failure rate is tracked over SamplingDuration

Open state (tripped):
  → All requests fail immediately with BrokenCircuitException
  → No requests reach the downstream service
  → Remains open for BreakDuration

Half-open state (testing):
  → After BreakDuration, one probe request is allowed through
  → If probe succeeds: circuit closes (back to normal)
  → If probe fails: circuit opens again for BreakDuration

Clinical impact:
  → If FHIR patient service is down, circuit opens after 5 failures
  → For 15 seconds, prescription lookups fail fast (no timeout delays)
  → After 15 seconds, one probe request tests if FHIR is back up
  → Prevents cascading failure: PrescriptionService doesn't bog down
    waiting for timeouts from a dead PatientService

Monitor circuit state:
  → Use OpenTelemetry metrics to track circuit breaker state changes
  → Alert when circuit opens for more than 60 seconds (upstream is seriously degraded)

Hedging for Low-Latency Clinical Reads

C#
// Hedging: send a second request if the first doesn't respond within a threshold
// Use when: request latency must be low and occasional slowness is unacceptable

// Example: INR check must complete in under 500ms for clinical workflow
builder.Services.AddHttpClient<ILabResultsClient, LabResultsClient>(...)
    .AddResilienceHandler("lab-results-hedged", builder =>
    {
        builder.AddHedging(new HttpHedgingStrategyOptions
        {
            MaxHedgedAttempts = 2,
            Delay             = TimeSpan.FromMilliseconds(300),
            // After 300ms with no response, send a second parallel request
            // First response to come back wins; other is cancelled
        });

        builder.AddTimeout(TimeSpan.FromMilliseconds(800)); // hard limit
    });

// Trade-off:
// + p50 latency unchanged, p99 latency dramatically reduced
// - Doubles request load on LabResults service during slow periods
// Use hedging only for read-only, idempotent endpoints

Rate Limiting Outbound Calls

C#
// Prevent your service from overwhelming a downstream with too many concurrent calls

builder.Services.AddHttpClient<IMhraReportingClient, MhraReportingClient>(...)
    .AddResilienceHandler("mhra-rate-limited", builder =>
    {
        // Max 10 concurrent calls to MHRA (they have rate limits in their SLA)
        builder.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 10,
            QueueLimit  = 50
        });

        builder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            // Retry on 429 Too Many Requests from MHRA:
            ShouldHandle = args =>
                ValueTask.FromResult(
                    args.Outcome.Result?.StatusCode == HttpStatusCode.TooManyRequests)
        });
    });

Production issue I've seen: A prescription service was calling a patient demographics service synchronously. The demographics service deployed a slow migration and response times went from 50ms to 8 seconds. Without a circuit breaker, every prescription request held a thread for 8 seconds. Within 2 minutes, the prescription service's thread pool was exhausted — it stopped responding to all requests, not just the prescription ones. The entire clinical platform went down because of a slow migration in one service. Adding a circuit breaker with a 3-second timeout would have isolated the failure: prescriptions would have failed with "patient service unavailable" (a graceful degradation) while ward lookup, lab results, and billing continued unaffected.


Key Takeaway

Aspire's AddServiceDefaults wires a standard resilience pipeline (retry, circuit breaker, timeouts) for all HTTP clients automatically. Customise per-client with AddResilienceHandler for clinical SLAs: fewer retries for time-critical endpoints, shorter timeouts for real-time clinical workflows. Circuit breakers prevent cascading failures — when one service is slow, the circuit opens and other services fast-fail instead of exhausting thread pools. Use hedging for read-only low-latency critical paths. Rate limit outbound calls to external services with known rate limits (MHRA, FHIR registries).

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.