Chaos Engineering in .NET: Test Resilience Before Production Does

What is Chaos Engineering?

Chaos engineering is the practice of deliberately injecting failures into your system to verify it behaves correctly when things go wrong — before production does it for you.

The question isn't if your dependencies will fail. It's whether your system handles it gracefully.

What chaos engineering tests:

Does your circuit breaker actually open?
Does your retry policy back off correctly?
Does a slow database degrade or crash the service?
Does the UI show a friendly error or an ugly stack trace?
Does monitoring alert when a dependency is degraded?

Polly Resilience Pipeline (.NET 8+)

Before injecting chaos, you need resilience policies in place. Polly 8 uses a composable pipeline:

Bash

dotnet add package Microsoft.Extensions.Http.Resilience  # Polly 8 + HttpClient
dotnet add package Polly.Extensions

// Standard resilience pipeline for HTTP clients
builder.Services.AddHttpClient<IProductService, ProductService>()
    .AddStandardResilienceHandler(options =>
    {
        options.Retry.MaxRetryAttempts           = 3;
        options.Retry.Delay                      = TimeSpan.FromMilliseconds(200);
        options.Retry.BackoffType                = DelayBackoffType.Exponential;
        options.CircuitBreaker.BreakDuration     = TimeSpan.FromSeconds(30);
        options.CircuitBreaker.FailureRatio      = 0.5;
        options.CircuitBreaker.SamplingDuration  = TimeSpan.FromSeconds(60);
        options.TotalRequestTimeout.Timeout      = TimeSpan.FromSeconds(10);
    });

Simmy: Chaos Policies for Polly

Simmy adds chaos-specific policies that inject faults on demand:

Bash

dotnet add package Polly.Simmy

Fault Injection (Exceptions)

// Inject exceptions 20% of the time
var chaosPipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddChaosException(new ChaosExceptionStrategyOptions<HttpResponseMessage>
    {
        InjectionRate      = 0.2,  // 20% of calls
        EnabledGenerator   = _ => ValueTask.FromResult(IsChaosEnabled()),
        ExceptionGenerator = _ => ValueTask.FromResult<Exception>(
            new HttpRequestException("Chaos: simulated service unavailable"))
    })
    .Build();

// Compose with resilience pipeline
var fullPipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddPipeline(resilience)
    .AddPipeline(chaosPipeline)
    .Build();

Latency Injection

var latencyPipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddChaosLatency(new ChaosLatencyStrategyOptions<HttpResponseMessage>
    {
        InjectionRate     = 0.3,           // 30% of calls
        Latency           = TimeSpan.FromSeconds(3),  // add 3 second delay
        EnabledGenerator  = _ => ValueTask.FromResult(IsChaosEnabled())
    })
    .Build();

HTTP Response Code Injection

var responsePipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddChaosOutcome(new ChaosOutcomeStrategyOptions<HttpResponseMessage>
    {
        InjectionRate    = 0.1,
        EnabledGenerator = _ => ValueTask.FromResult(IsChaosEnabled()),
        OutcomeGenerator = _ => ValueTask.FromResult<Outcome<HttpResponseMessage>>(
            Outcome.FromResult(new HttpResponseMessage(HttpStatusCode.ServiceUnavailable)))
    })
    .Build();

Feature-Flag Controlled Chaos

Chaos should be controllable — not always on:

public class ChaosController
{
    private volatile bool _enabled = false;
    private double _faultRate      = 0.1;
    private TimeSpan _latency      = TimeSpan.FromSeconds(1);

    public bool IsEnabled()       => _enabled;
    public double GetFaultRate()  => _faultRate;
    public TimeSpan GetLatency()  => _latency;

    public void Enable(double faultRate = 0.1, int latencyMs = 1000)
    {
        _faultRate = faultRate;
        _latency   = TimeSpan.FromMilliseconds(latencyMs);
        _enabled   = true;
    }

    public void Disable() => _enabled = false;
}

// Register as singleton
builder.Services.AddSingleton<ChaosController>();

// Admin endpoint (secured) to enable/disable chaos
[Authorize(Roles = "Admin")]
[HttpPost("admin/chaos/enable")]
public IActionResult EnableChaos(
    [FromQuery] double faultRate = 0.1,
    [FromQuery] int latencyMs    = 1000)
{
    _chaosController.Enable(faultRate, latencyMs);
    return Ok("Chaos enabled");
}

Integration Tests with Chaos

public class OrderServiceChaosTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly WebApplicationFactory<Program> _factory;

    [Fact]
    public async Task WhenProductServiceFails_OrderCreation_ReturnsError()
    {
        var client = _factory.WithWebHostBuilder(builder =>
        {
            builder.ConfigureServices(services =>
            {
                // Replace real product service with chaos version
                services.AddHttpClient<IProductService, ProductService>()
                    .AddResilienceHandler("chaos", pipeline =>
                    {
                        pipeline.AddChaosException(new ChaosExceptionStrategyOptions<HttpResponseMessage>
                        {
                            InjectionRate    = 1.0,  // always fail
                            EnabledGenerator = _ => ValueTask.FromResult(true),
                            ExceptionGenerator = _ => ValueTask.FromResult<Exception>(
                                new HttpRequestException("Product service unavailable"))
                        });
                    });
            });
        }).CreateClient();

        var response = await client.PostAsJsonAsync("/api/orders", new
        {
            customerId = Guid.NewGuid(),
            lines      = new[] { new { productId = Guid.NewGuid(), quantity = 1, unitPrice = 10.0 } }
        });

        // Order creation should return 503, not 500
        Assert.Equal(HttpStatusCode.ServiceUnavailable, response.StatusCode);

        var body = await response.Content.ReadFromJsonAsync<ProblemDetails>();
        Assert.Contains("product service", body?.Detail, StringComparison.OrdinalIgnoreCase);
    }

    [Fact]
    public async Task WhenProductServiceSlow_OrderCreation_TimesOut_Gracefully()
    {
        var client = _factory.WithWebHostBuilder(builder =>
        {
            builder.ConfigureServices(services =>
            {
                services.AddHttpClient<IProductService, ProductService>()
                    .AddResilienceHandler("chaos", pipeline =>
                    {
                        pipeline.AddChaosLatency(new ChaosLatencyStrategyOptions<HttpResponseMessage>
                        {
                            InjectionRate    = 1.0,
                            Latency          = TimeSpan.FromSeconds(15),
                            EnabledGenerator = _ => ValueTask.FromResult(true)
                        });
                    });
            });
        }).CreateClient();

        var sw       = Stopwatch.StartNew();
        var response = await client.PostAsJsonAsync("/api/orders", ValidRequest);
        sw.Stop();

        // Should time out after the configured timeout (e.g., 10s), not hang for 15s
        Assert.True(sw.ElapsedMilliseconds < 12_000, $"Timed out after {sw.ElapsedMilliseconds}ms");
        Assert.Equal(HttpStatusCode.GatewayTimeout, response.StatusCode);
    }
}

Game Day

A Game Day is a structured exercise where the team runs chaos experiments on production (or production-like staging) and observes the system's behaviour.

Running a Game Day:

Define the hypothesis: "When the product service is unavailable, order creation returns a 503 within 5 seconds and does not throw exceptions."
Set the blast radius:
- Inject faults on only 10% of traffic to start
- Have a rollback plan (turn off chaos toggle)
- Inform the on-call team
Inject the fault:
- Enable chaos via the admin endpoint
- Set fault rate to 10%
Observe:
- Error rate in Grafana
- Latency P99
- Circuit breaker state
- Alert firing?
Verify the hypothesis:
- Did the service degrade gracefully?
- Did monitoring catch it?
- Did the circuit breaker open?
Clean up and document:
- Disable chaos
- Document findings, create tickets for failures

Resilience Checklist

Before a Game Day:

☐ Retry policy configured for all outbound calls
☐ Circuit breaker configured for external dependencies
☐ Timeouts defined for all outbound calls (not infinite)
☐ Fallback responses defined (cached data, graceful degradation)
☐ Cancellation tokens threaded through the call chain
☐ Health checks report dependency failures
☐ Alerts fire within 2 minutes of sustained errors
☐ On-call runbook exists for each dependency failure
☐ Chaos toggle secured (admin-only endpoint or feature flag)

Interview Questions

Q: What is chaos engineering and how does it differ from testing? Testing verifies the system does what it's supposed to do in known conditions. Chaos engineering verifies the system behaves acceptably under unknown or failure conditions. You're testing the unknown — what happens when the database is 5x slower than normal, when a dependency returns 503, or when a network partition occurs.

Q: What is Simmy and how does it integrate with Polly? Simmy is a chaos extension for Polly that adds fault injection policies — exception injection, latency injection, and response code injection — on a per-call basis with configurable injection rates. It integrates as additional pipeline steps in Polly's resilience pipeline, so chaos and resilience policies compose naturally.

Q: Why is controlling chaos important — why not just always inject faults? Chaos in production must be controllable to avoid impacting real users beyond what the experiment intends. A feature-flag-based toggle lets you enable chaos for a small percentage of traffic, observe the effect, and disable it instantly if something unexpected happens. Uncontrolled chaos that can't be quickly turned off is just sabotage.

Q: What is a Game Day in chaos engineering? A structured exercise where the team deliberately injects faults into a production-like environment and observes system behaviour. The goal is to validate resilience hypotheses ("the circuit breaker opens within 10 seconds") and discover gaps in monitoring and runbooks. Unlike ad-hoc testing, a Game Day is planned, has a hypothesis, a defined blast radius, monitoring in place, and a rollback plan.