Back to Case Studies
devopsintermediate 8 min read

Netflix

Netflix's Chaos Engineering: Building Resilience by Breaking Things

How Netflix intentionally kills production services to find failure modes before customers do

Key outcome: Industry-defining resilience
Chaos EngineeringResilienceAWSDistributed SystemsSRE

The Problem with Traditional Testing

In 2010, Netflix moved from a self-managed data centre to AWS. The migration was a bet on cloud infrastructure that wasn't common at the time — and it exposed a fundamental problem.

Cloud infrastructure is unreliable by design. EC2 instances terminate unexpectedly. Network partitions happen. Availability zones go dark. The infrastructure is cheap and scalable precisely because it doesn't guarantee the five-nines uptime of traditional data centre hardware.

Netflix's architecture — hundreds of services talking to each other — meant that any individual failure could cascade. If the recommendations service went down, did the homepage degrade gracefully or did it take down streaming? No one knew for certain, because no one had ever tested it.

The traditional response: write better tests. Unit tests, integration tests, end-to-end tests. But these tests ran in staging environments designed to be stable. They couldn't reveal how the system behaved under real, unexpected production failures.

Chaos Monkey was the answer.


What Chaos Monkey Does

Chaos Monkey is a service that randomly terminates EC2 instances during business hours.

That's it. It's conceptually simple — and it was deeply controversial when introduced.

The reasoning:

  • EC2 instances will fail in production, whether you're ready or not
  • Better to discover failure modes during business hours, when engineers are awake
  • Each failure that goes unhandled becomes a known, fixed failure mode
  • Over time, the system becomes resilient to all known failure modes

The first time Chaos Monkey ran, it revealed dozens of services that crashed when their dependencies went down instead of degrading gracefully. Engineers spent weeks hardening services. Then Chaos Monkey ran again. More failures. More hardening.

Over months, Netflix's production services became genuinely resilient to single-instance failures — not because they were theoretically designed to be, but because they had been empirically tested under real failure conditions.


The Evolution: The Simian Army

Chaos Monkey was just the beginning. Netflix expanded the concept into a suite of tools they called the Simian Army:

Latency Monkey

Injects artificial network latency into service calls. Reveals services that don't have proper timeouts — a call that normally takes 10ms might hang forever if the downstream service becomes slow.

JAVA
// What Latency Monkey tests for:
// Services MUST have timeouts on all downstream calls
RestTemplate template = new RestTemplate();
template.getRequestFactory().setConnectTimeout(1000);  // 1s connect timeout
template.getRequestFactory().setReadTimeout(3000);     // 3s read timeout

// Without these: slow downstream = slow (or hung) upstream

Conformity Monkey

Identifies EC2 instances that don't conform to Netflix's best practices — wrong AMI, missing security groups, not in an auto-scaling group. Terminates non-conforming instances to enforce standards.

Doctor Monkey

Monitors instance health metrics. Identifies and terminates "sick" instances (high CPU, memory pressure, degraded disk) before they cause cascading failures.

Janitor Monkey

Cleans up unused cloud resources — orphaned EBS volumes, old snapshots, security groups no longer referenced. Reduces cloud costs.

Chaos Gorilla

Simulates the failure of an entire Availability Zone. An AZ failure is a relatively rare but devastating AWS event. Chaos Gorilla verifies that Netflix can survive it without customer impact.

Chaos Kong

Simulates the failure of an entire AWS Region. Netflix's most extreme chaos test — running it in production is a statement about the confidence level in their multi-region architecture.


The Principles Behind Chaos Engineering

Netflix's practice evolved into a formal discipline, codified in the Principles of Chaos Engineering (2016):

1. Build a Hypothesis Around Steady State

Define measurable business metrics — streaming starts per second, successful API requests, error rates. A chaos experiment asks: "Does this failure change steady state?"

Hypothesis: Terminating 10% of recommendation service instances
will NOT change streaming start rate by more than 1%

2. Vary Real-World Events

Don't invent failures. Test for failures that actually happen in production:

  • Instance termination
  • Network latency spikes
  • High CPU or memory pressure
  • Downstream service unavailability
  • Bad deployments (via feature flags)

3. Run Experiments in Production

Staging environments don't behave like production. Traffic patterns differ. Data sizes differ. Caches are cold. The only reliable test environment is production.

This is the most controversial principle — and the most important.

4. Automate Experiments to Run Continuously

One-time chaos tests rot. The system changes; new services are added; old failure modes reappear. Continuous automated experiments catch regressions.

5. Minimise Blast Radius

Start small. Terminate one instance, not a hundred. Inject latency to 1% of requests, not 100%. Expand scope only after demonstrating that the system handles small failures correctly.


The Technical Infrastructure

Running chaos experiments safely at Netflix's scale required tooling:

Automatic Rollback

Every chaos experiment has a defined blast radius and automatic rollback criteria:

Python
class ChaosExperiment:
    def run(self, target: str, duration_seconds: int):
        baseline_sps = self.metrics.streaming_starts_per_second()

        self.inject_failure(target)

        for _ in range(duration_seconds):
            current_sps = self.metrics.streaming_starts_per_second()
            if current_sps < baseline_sps * 0.95:  # 5% degradation threshold
                self.rollback()
                self.alert("Chaos experiment rolled back: SPS degraded")
                return

            time.sleep(1)

        self.rollback()
        self.record_result("success")

Feature Flags as a Chaos Tool

Netflix uses feature flags to simulate a different kind of chaos — bad code paths:

JAVA
@FeatureFlag("use-new-recommendation-algorithm")
public List<Show> getRecommendations(String userId) {
    if (featureFlags.isEnabled("use-new-recommendation-algorithm")) {
        return newAlgorithm.recommend(userId);
    }
    return legacyAlgorithm.recommend(userId);
}

Gradually rolling out a flag (1% → 5% → 25% → 100%) is controlled chaos — exposing the new code to real traffic before full rollout.


What Netflix's Architecture Looks Like After Chaos

After years of Chaos Engineering, Netflix's architecture embodies these patterns:

Fallback Chains

Every service call has a fallback chain — what to show the user when the service is unavailable:

Personalised Recommendations
  → Cached Recommendations (from 1 hour ago)
    → Trending in Your Region
      → Most Popular Globally
        → Static Editorial List

The user always sees something. The experience degrades, but never breaks completely.

Hystrix (Circuit Breaker)

Netflix open-sourced Hystrix — a circuit breaker library that prevents cascade failures:

JAVA
@HystrixCommand(fallbackMethod = "getRecommendationsFallback",
                commandKey = "GetRecommendations",
                commandProperties = {
                    @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "1000"),
                    @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50")
                })
public List<Show> getRecommendations(String userId) {
    return recommendationService.fetch(userId);
}

public List<Show> getRecommendationsFallback(String userId) {
    return trendingService.getGlobalTrending();
}

If 50% of calls to recommendationService fail within a window, the circuit opens — subsequent calls go directly to the fallback without even attempting the failing service.


The Broader Lesson

Chaos Engineering is not about being reckless. It's about empirical confidence instead of theoretical confidence.

Traditional testing answers: "Does the code do what we wrote it to do?" Chaos Engineering answers: "Does the system behave correctly when reality diverges from assumptions?"

Every production system has hidden assumptions:

  • "The database will always be reachable"
  • "The downstream service will respond in under 500ms"
  • "DNS resolution will always work"

Chaos Engineering surfaces these assumptions systematically, before customers surface them accidentally.


Getting Started Without Netflix's Scale

You don't need Chaos Monkey to start practising chaos engineering. The principles apply at any scale:

  1. Define steady state — what metrics say "the system is healthy"?
  2. Run a game day — pick a realistic failure scenario, schedule it during business hours, observe the system
  3. Document failure modes — what broke? What didn't?
  4. Harden the failure modes — fallbacks, timeouts, circuit breakers
  5. Repeat — gradually expand scope

Tools like Chaos Toolkit, Gremlin, and AWS Fault Injection Simulator make this accessible without building the tooling yourself.


Further Reading

Related Case Studies

Go Deeper

Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.

Explore Courses