Platform Engineering: Chaos Engineering and SLO Management — Chaos Mesh, Sloth, Error Budgets, and Game Days

Why Reliability Engineering Belongs in the Platform

Most teams only discover their reliability weaknesses during incidents. By then, customers are affected, engineers are stressed, and the fix is reactive.

Platform engineering can change this by providing:

SLO tooling — define reliability targets, track error budgets, and get warned before you break your SLA
Chaos engineering — proactively discover weaknesses in controlled experiments before they become incidents

Together, these create a reliability feedback loop: know your targets, inject failures deliberately, measure the impact, fix the weakest link, repeat.

The key insight from Google's SRE book: an SLO without a chaos program is a target you never test.

SLOs: Defining Reliability Targets

SLI → SLO → SLA

SLI (Service Level Indicator): a metric that measures reliability (e.g., success rate of HTTP requests)
SLO (Service Level Objective): the target (e.g., 99.9% success rate over 30 days)
SLA (Service Level Agreement): the contract (e.g., "we promise 99.5% — below that, you get a refund")

Your SLO should be stricter than your SLA — catch problems before they breach the customer-facing SLA.

Good SLIs for web services

Availability SLI = (successful requests) / (total requests)
  Successful = HTTP 2xx or 3xx (NOT 5xx)
  
Latency SLI = (requests completed within 200ms) / (total requests)
  P95 or P99 — not average (average hides tail latency)

Error rate SLI = 1 - (error_rate)
  Error = 5xx responses or timed-out requests

Implementing SLOs with Sloth

Sloth generates Prometheus recording rules and alerting rules from a simple SLO spec. No manual Prometheus rule writing.

YAML

# payment-service-slos.yaml
version: prometheus/v1
service: payment-service
labels:
  team: team-payments
  env: production

slos:
  - name: requests-availability
    objective: 99.9       # 99.9% of requests must succeed
    description: "99.9% of payment requests succeed"
    sli:
      events:
        error_query: |
          sum(rate(http_requests_total{service="payment-service", code=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(http_requests_total{service="payment-service"}[{{.window}}]))
    alerting:
      name: PaymentServiceHighErrorRate
      labels:
        severity: page
      annotations:
        summary: "Payment service error rate burning error budget"
        runbook: "https://runbook.internal/payment-service/high-error-rate"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

  - name: requests-latency
    objective: 99.5       # 99.5% of requests must complete within 200ms
    description: "P99.5 of payment requests complete in 200ms"
    sli:
      events:
        error_query: |
          sum(rate(http_request_duration_seconds_bucket{
            service="payment-service",
            le="0.2"
          }[{{.window}}]))
        total_query: |
          sum(rate(http_request_duration_seconds_count{service="payment-service"}[{{.window}}]))

Generate Prometheus rules:

Bash

sloth generate -i payment-service-slos.yaml -o payment-service-rules.yaml
kubectl apply -f payment-service-rules.yaml

Sloth generates recording rules for multiple burn rate windows (1h, 6h, 24h, 3d, 30d) and multi-window burn rate alerts following the Google SRE Workbook methodology.

Error Budget

With a 99.9% SLO over 30 days:

Error budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of downtime allowed per 30 days

Sloth generates a recording rule: slo:error_budget_remaining:ratio

If this drops below 0, you've breached the SLO for the current window. If it drops below 10%, you're burning fast.

Error Budget Policy

The error budget policy translates budget status into deployment decisions:

| Budget Remaining | Policy | |-----------------|--------| | > 50% | Full deployment velocity, can experiment | | 25-50% | Normal deployments, increase monitoring | | 10-25% | No risky changes, all deploys require extra review | | < 10% | Freeze new features, only reliability fixes | | 0% (breached) | All deploys halted, reliability sprint |

Implement this in your CI/CD pipeline:

YAML

# GitHub Actions: check error budget before deploy
- name: Check SLO error budget
  run: |
    BUDGET=$(curl -s http://prometheus.internal/api/v1/query \
      --data-urlencode 'query=slo:error_budget_remaining:ratio{service="payment-service"}' \
      | jq -r '.data.result[0].value[1]')
    
    echo "Error budget remaining: ${BUDGET}"
    
    if (( $(echo "$BUDGET < 0.10" | bc -l) )); then
      echo "ERROR: Error budget < 10%. Deploy blocked by SLO policy."
      echo "Remaining budget: ${BUDGET}"
      exit 1
    fi

Implementing SLOs with Pyrra

Pyrra is a Kubernetes-native alternative to Sloth — define SLOs as CRDs, Pyrra generates the Prometheus rules and provides a built-in SLO dashboard.

YAML

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: payment-availability
  namespace: monitoring
spec:
  target: "99.9"
  window: 30d
  description: "99.9% of payment requests must succeed"
  indicator:
    ratio:
      errors:
        metric: http_requests_total{service="payment-service", code=~"5.."}
      total:
        metric: http_requests_total{service="payment-service"}
  alerting:
    disabled: false

Bash

# Apply the SLO CRD
kubectl apply -f payment-slo.yaml

# Pyrra generates:
# - Recording rules for error budget
# - Multi-burn-rate alert rules
# - SLO dashboard data

Pyrra vs Sloth:

Pyrra: K8s-native, CRD-based, built-in UI — better for platform self-service (teams apply YAML, get SLO)
Sloth: CLI-based, more flexible SLI expressions, generates any output format

For a platform where teams self-service their SLOs: use Pyrra. Teams apply a YAML file, the platform handles the rest.

Platform SLO Service for Teams

The platform team provides:

SLO CRD (Pyrra) — teams define their target in YAML
Automatic Grafana dashboard — provisioned from ConfigMap when SLO CRD is applied
PagerDuty routing — burn rate alerts route to team's on-call schedule
Backstage SLO widget — each catalog entry shows current error budget status

YAML

# What a product team creates (one file):
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: order-service-availability
  namespace: team-orders
  labels:
    team: team-orders
    service: order-service
spec:
  target: "99.95"
  window: 30d
  indicator:
    ratio:
      errors:
        metric: http_requests_total{service="order-service", code=~"5.."}
      total:
        metric: http_requests_total{service="order-service"}

The platform provides everything else — rules, alerts, dashboards, routing. The team just defines the target.

Chaos Engineering with Chaos Mesh

Chaos Mesh is a CNCF-incubated chaos engineering platform for Kubernetes. Define experiments as CRDs; Chaos Mesh executes them against live workloads.

Installation

Bash

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-testing \
  --create-namespace \
  --version 2.6.0 \
  --set dashboard.create=true

Experiment Types

PodChaos — Pod kill

YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: random-max-percent   # kill random % of pods
  value: "30"                # kill up to 30% of pods
  selector:
    namespaces:
      - payments
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 10m"       # run every 10 minutes

NetworkChaos — Latency injection

YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-network-delay
  namespace: chaos-testing
spec:
  action: delay
  mode: one                   # affect one pod
  selector:
    namespaces: [payments]
    labelSelectors:
      app: payment-service
  delay:
    latency: "500ms"          # add 500ms of latency
    correlation: "25"         # 25% correlation between packets
    jitter: "100ms"           # ± 100ms jitter
  direction: to               # only affect inbound traffic
  duration: "5m"              # run for 5 minutes

StressChaos — CPU and memory stress

YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: order-service-cpu-stress
spec:
  mode: one
  selector:
    namespaces: [orders]
    labelSelectors:
      app: order-service
  stressors:
    cpu:
      workers: 2              # 2 workers generating CPU load
      load: 80                # 80% CPU load
    memory:
      workers: 1
      size: 512MB             # 512MB memory pressure
  duration: "3m"

DNSChaos — DNS resolution failure

YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: database-dns-failure
spec:
  action: random             # randomly fail DNS queries
  mode: all
  selector:
    namespaces: [orders]
  patterns:
    - "postgres.*.svc.cluster.local"  # only fail DB DNS
  duration: "2m"

Chaos Workflows

Combine multiple experiments into a workflow with timing and dependencies:

YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: order-service-resilience-test
spec:
  entry: entry
  templates:
    - name: entry
      type: Serial
      tasks:
        - baseline              # measure baseline
        - pod-kill              # kill 1 pod
        - wait-recovery         # wait for recovery
        - network-delay         # inject latency
        - wait-recovery
        - full-chaos            # kill pod + inject latency simultaneously

    - name: baseline
      type: Suspend
      suspend:
        duration: "2m"          # observe baseline for 2 minutes

    - name: pod-kill
      type: PodChaos
      # spec: (inline pod kill spec)

    - name: wait-recovery
      type: Suspend
      suspend:
        duration: "3m"

    - name: network-delay
      type: NetworkChaos
      # spec: (inline network delay spec)

Running a Game Day

A Game Day is a structured, announced chaos exercise where the team tests their system's response to failures.

Game Day structure

Before (1 week ahead):

Define the hypothesis: "If pod-kill removes 30% of payment-service pods, the service will remain within SLO (< 0.1% error rate) due to HPA and health probes"
Define blast radius: staging environment, not production
Notify all stakeholders
Verify monitoring is in place (dashboards, alerts working)
Brief the on-call engineer

During (1-2 hours):

0:00 - Baseline: measure current error rate, latency, and resource usage
0:05 - Inject failure #1: kill 30% of pods (PodChaos)
0:08 - Observe: did HPA kick in? What's the error rate?
0:15 - Stop experiment: measure recovery time
0:20 - Rest: confirm system has returned to baseline
0:25 - Inject failure #2: network latency 500ms to database
0:30 - Observe: does the app gracefully degrade? Circuit breaker trip?
0:35 - Stop experiment
0:40 - Inject failure #3: simultaneous pod kill + DB latency
0:50 - Stop experiment, observe recovery
1:00 - Debrief: what worked, what didn't, what surprised us

After (blameless post-mortem):

Document findings: which experiments passed (SLO held), which failed (SLO breached)
Create tickets for reliability improvements
Share results across engineering (build organizational learning)
Schedule next Game Day (quarterly)

Measuring game day success

Use Prometheus to compare SLI during experiment vs baseline:

Bash

# Error rate during experiment (15:00-15:10)
rate(http_requests_total{service="payment-service", code=~"5.."}[5m] @ 15:05:00)

# Error rate before experiment (14:50-15:00)
rate(http_requests_total{service="payment-service", code=~"5.."}[5m] @ 14:55:00)

A successful experiment: SLO held during the chaos. A failed experiment: even better — you found a weakness before a real incident did.

Continuous Chaos: Beyond Game Days

Game Days are scheduled. Real incidents aren't. Mature organizations run continuous chaos in production.

Principles for production chaos

Start in staging, graduate to production slowly
Start small: kill 1 pod, not 50%
Define a steady state and abort if breached
Always have a kill switch (Chaos Mesh pause button)
Monitor SLOs during experiments — if budget starts burning fast, stop

Automated chaos with steady-state checks

YAML

# Chaos Workflow with abort conditions
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: continuous-resilience
spec:
  templates:
    - name: pod-kill-with-guard
      type: Serial
      tasks:
        - check-slo-before     # abort if SLO already burning
        - pod-kill             # run experiment
        - check-slo-after      # verify SLO held

    - name: check-slo-before
      type: Task
      task:
        # Custom task: query Prometheus, fail if budget < 50%
        container:
          image: curlimages/curl
          command:
            - sh
            - -c
            - |
              BUDGET=$(curl -s "http://prometheus.internal/api/v1/query" \
                --data-urlencode 'query=slo:error_budget_remaining:ratio{service="payment-service"}' \
                | jq -r '.data.result[0].value[1]')
              if (( $(echo "$BUDGET < 0.50" | bc -l) )); then
                echo "Error budget < 50%, skipping chaos"
                exit 1
              fi

Chaos engineering maturity model

| Level | Practice | |-------|---------| | 1 — Reactive | Only learn from real incidents | | 2 — Game Days | Quarterly announced chaos in staging | | 3 — Proactive staging | Weekly automated chaos in staging | | 4 — Production chaos | Continuous low-impact chaos in production with SLO guards | | 5 — System chaos | Multi-region failure scenarios, dependency removal experiments |

Most organizations should target Level 3. Level 4+ requires strong observability, incident response maturity, and organizational trust.

The Reliability Loop

The platform team's role in reliability:

Platform provides:                    Teams use to:
─────────────────                     ─────────────────
SLO CRDs (Pyrra)           →         Define their reliability targets
Error budget dashboards    →         Monitor their budget status
Multi-burn-rate alerts     →         Get warned before breaching SLA
Error budget policy in CI  →         Pause risky deploys when budget is low
Chaos Mesh                 →         Run experiments against their services
Game Day templates         →         Structure quarterly reliability exercises
Post-incident templates    →         Document and share learnings

Result:
Teams own their reliability.
Incidents are fewer, shorter, and better understood.
SLAs are met with data, not hope.

Chaos engineering without SLOs is just breaking things for fun. The SLO is the measure. The chaos experiment tests whether you meet it under adverse conditions. Together they're the only honest way to know if your system is actually reliable.

Platform Engineering: Chaos Engineering and SLO Management — Chaos Mesh, Sloth, Error Budgets, and Game Days

Why Reliability Engineering Belongs in the Platform

SLOs: Defining Reliability Targets

SLI → SLO → SLA

Good SLIs for web services

Implementing SLOs with Sloth

Error Budget

Error Budget Policy

Implementing SLOs with Pyrra

Platform SLO Service for Teams

Chaos Engineering with Chaos Mesh

Installation

Experiment Types

Chaos Workflows

Running a Game Day

Game Day structure

Measuring game day success

Continuous Chaos: Beyond Game Days

Principles for production chaos

Automated chaos with steady-state checks

Chaos engineering maturity model

The Reliability Loop

Enjoyed this article?

Leave a comment