Platform Engineering: Chaos Engineering and SLO Management — Chaos Mesh, Sloth, Error Budgets, and Game Days
Deep guide to reliability engineering for platform teams — SLO definition with Sloth and Pyrra, error budget policies, chaos experiments with Chaos Mesh and LitmusChaos, running a structured Game Day, and continuous chaos in production.
Why Reliability Engineering Belongs in the Platform
Most teams only discover their reliability weaknesses during incidents. By then, customers are affected, engineers are stressed, and the fix is reactive.
Platform engineering can change this by providing:
- SLO tooling — define reliability targets, track error budgets, and get warned before you break your SLA
- Chaos engineering — proactively discover weaknesses in controlled experiments before they become incidents
Together, these create a reliability feedback loop: know your targets, inject failures deliberately, measure the impact, fix the weakest link, repeat.
The key insight from Google's SRE book: an SLO without a chaos program is a target you never test.
SLOs: Defining Reliability Targets
SLI → SLO → SLA
- SLI (Service Level Indicator): a metric that measures reliability (e.g., success rate of HTTP requests)
- SLO (Service Level Objective): the target (e.g., 99.9% success rate over 30 days)
- SLA (Service Level Agreement): the contract (e.g., "we promise 99.5% — below that, you get a refund")
Your SLO should be stricter than your SLA — catch problems before they breach the customer-facing SLA.
Good SLIs for web services
Availability SLI = (successful requests) / (total requests)
Successful = HTTP 2xx or 3xx (NOT 5xx)
Latency SLI = (requests completed within 200ms) / (total requests)
P95 or P99 — not average (average hides tail latency)
Error rate SLI = 1 - (error_rate)
Error = 5xx responses or timed-out requestsImplementing SLOs with Sloth
Sloth generates Prometheus recording rules and alerting rules from a simple SLO spec. No manual Prometheus rule writing.
# payment-service-slos.yaml
version: prometheus/v1
service: payment-service
labels:
team: team-payments
env: production
slos:
- name: requests-availability
objective: 99.9 # 99.9% of requests must succeed
description: "99.9% of payment requests succeed"
sli:
events:
error_query: |
sum(rate(http_requests_total{service="payment-service", code=~"5.."}[{{.window}}]))
total_query: |
sum(rate(http_requests_total{service="payment-service"}[{{.window}}]))
alerting:
name: PaymentServiceHighErrorRate
labels:
severity: page
annotations:
summary: "Payment service error rate burning error budget"
runbook: "https://runbook.internal/payment-service/high-error-rate"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: requests-latency
objective: 99.5 # 99.5% of requests must complete within 200ms
description: "P99.5 of payment requests complete in 200ms"
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_bucket{
service="payment-service",
le="0.2"
}[{{.window}}]))
total_query: |
sum(rate(http_request_duration_seconds_count{service="payment-service"}[{{.window}}]))Generate Prometheus rules:
sloth generate -i payment-service-slos.yaml -o payment-service-rules.yaml
kubectl apply -f payment-service-rules.yamlSloth generates recording rules for multiple burn rate windows (1h, 6h, 24h, 3d, 30d) and multi-window burn rate alerts following the Google SRE Workbook methodology.
Error Budget
With a 99.9% SLO over 30 days:
Error budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
= 0.001 × 43,200 minutes
= 43.2 minutes of downtime allowed per 30 daysSloth generates a recording rule: slo:error_budget_remaining:ratio
If this drops below 0, you've breached the SLO for the current window. If it drops below 10%, you're burning fast.
Error Budget Policy
The error budget policy translates budget status into deployment decisions:
| Budget Remaining | Policy | |-----------------|--------| | > 50% | Full deployment velocity, can experiment | | 25-50% | Normal deployments, increase monitoring | | 10-25% | No risky changes, all deploys require extra review | | < 10% | Freeze new features, only reliability fixes | | 0% (breached) | All deploys halted, reliability sprint |
Implement this in your CI/CD pipeline:
# GitHub Actions: check error budget before deploy
- name: Check SLO error budget
run: |
BUDGET=$(curl -s http://prometheus.internal/api/v1/query \
--data-urlencode 'query=slo:error_budget_remaining:ratio{service="payment-service"}' \
| jq -r '.data.result[0].value[1]')
echo "Error budget remaining: ${BUDGET}"
if (( $(echo "$BUDGET < 0.10" | bc -l) )); then
echo "ERROR: Error budget < 10%. Deploy blocked by SLO policy."
echo "Remaining budget: ${BUDGET}"
exit 1
fiImplementing SLOs with Pyrra
Pyrra is a Kubernetes-native alternative to Sloth — define SLOs as CRDs, Pyrra generates the Prometheus rules and provides a built-in SLO dashboard.
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: payment-availability
namespace: monitoring
spec:
target: "99.9"
window: 30d
description: "99.9% of payment requests must succeed"
indicator:
ratio:
errors:
metric: http_requests_total{service="payment-service", code=~"5.."}
total:
metric: http_requests_total{service="payment-service"}
alerting:
disabled: false# Apply the SLO CRD
kubectl apply -f payment-slo.yaml
# Pyrra generates:
# - Recording rules for error budget
# - Multi-burn-rate alert rules
# - SLO dashboard dataPyrra vs Sloth:
- Pyrra: K8s-native, CRD-based, built-in UI — better for platform self-service (teams apply YAML, get SLO)
- Sloth: CLI-based, more flexible SLI expressions, generates any output format
For a platform where teams self-service their SLOs: use Pyrra. Teams apply a YAML file, the platform handles the rest.
Platform SLO Service for Teams
The platform team provides:
- SLO CRD (Pyrra) — teams define their target in YAML
- Automatic Grafana dashboard — provisioned from ConfigMap when SLO CRD is applied
- PagerDuty routing — burn rate alerts route to team's on-call schedule
- Backstage SLO widget — each catalog entry shows current error budget status
# What a product team creates (one file):
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: order-service-availability
namespace: team-orders
labels:
team: team-orders
service: order-service
spec:
target: "99.95"
window: 30d
indicator:
ratio:
errors:
metric: http_requests_total{service="order-service", code=~"5.."}
total:
metric: http_requests_total{service="order-service"}The platform provides everything else — rules, alerts, dashboards, routing. The team just defines the target.
Chaos Engineering with Chaos Mesh
Chaos Mesh is a CNCF-incubated chaos engineering platform for Kubernetes. Define experiments as CRDs; Chaos Mesh executes them against live workloads.
Installation
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-testing \
--create-namespace \
--version 2.6.0 \
--set dashboard.create=trueExperiment Types
PodChaos — Pod kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: random-max-percent # kill random % of pods
value: "30" # kill up to 30% of pods
selector:
namespaces:
- payments
labelSelectors:
app: payment-service
scheduler:
cron: "@every 10m" # run every 10 minutesNetworkChaos — Latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-network-delay
namespace: chaos-testing
spec:
action: delay
mode: one # affect one pod
selector:
namespaces: [payments]
labelSelectors:
app: payment-service
delay:
latency: "500ms" # add 500ms of latency
correlation: "25" # 25% correlation between packets
jitter: "100ms" # ± 100ms jitter
direction: to # only affect inbound traffic
duration: "5m" # run for 5 minutesStressChaos — CPU and memory stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: order-service-cpu-stress
spec:
mode: one
selector:
namespaces: [orders]
labelSelectors:
app: order-service
stressors:
cpu:
workers: 2 # 2 workers generating CPU load
load: 80 # 80% CPU load
memory:
workers: 1
size: 512MB # 512MB memory pressure
duration: "3m"DNSChaos — DNS resolution failure
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: database-dns-failure
spec:
action: random # randomly fail DNS queries
mode: all
selector:
namespaces: [orders]
patterns:
- "postgres.*.svc.cluster.local" # only fail DB DNS
duration: "2m"Chaos Workflows
Combine multiple experiments into a workflow with timing and dependencies:
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: order-service-resilience-test
spec:
entry: entry
templates:
- name: entry
type: Serial
tasks:
- baseline # measure baseline
- pod-kill # kill 1 pod
- wait-recovery # wait for recovery
- network-delay # inject latency
- wait-recovery
- full-chaos # kill pod + inject latency simultaneously
- name: baseline
type: Suspend
suspend:
duration: "2m" # observe baseline for 2 minutes
- name: pod-kill
type: PodChaos
# spec: (inline pod kill spec)
- name: wait-recovery
type: Suspend
suspend:
duration: "3m"
- name: network-delay
type: NetworkChaos
# spec: (inline network delay spec)Running a Game Day
A Game Day is a structured, announced chaos exercise where the team tests their system's response to failures.
Game Day structure
Before (1 week ahead):
- Define the hypothesis: "If pod-kill removes 30% of payment-service pods, the service will remain within SLO (< 0.1% error rate) due to HPA and health probes"
- Define blast radius: staging environment, not production
- Notify all stakeholders
- Verify monitoring is in place (dashboards, alerts working)
- Brief the on-call engineer
During (1-2 hours):
0:00 - Baseline: measure current error rate, latency, and resource usage
0:05 - Inject failure #1: kill 30% of pods (PodChaos)
0:08 - Observe: did HPA kick in? What's the error rate?
0:15 - Stop experiment: measure recovery time
0:20 - Rest: confirm system has returned to baseline
0:25 - Inject failure #2: network latency 500ms to database
0:30 - Observe: does the app gracefully degrade? Circuit breaker trip?
0:35 - Stop experiment
0:40 - Inject failure #3: simultaneous pod kill + DB latency
0:50 - Stop experiment, observe recovery
1:00 - Debrief: what worked, what didn't, what surprised usAfter (blameless post-mortem):
- Document findings: which experiments passed (SLO held), which failed (SLO breached)
- Create tickets for reliability improvements
- Share results across engineering (build organizational learning)
- Schedule next Game Day (quarterly)
Measuring game day success
Use Prometheus to compare SLI during experiment vs baseline:
# Error rate during experiment (15:00-15:10)
rate(http_requests_total{service="payment-service", code=~"5.."}[5m] @ 15:05:00)
# Error rate before experiment (14:50-15:00)
rate(http_requests_total{service="payment-service", code=~"5.."}[5m] @ 14:55:00)A successful experiment: SLO held during the chaos. A failed experiment: even better — you found a weakness before a real incident did.
Continuous Chaos: Beyond Game Days
Game Days are scheduled. Real incidents aren't. Mature organizations run continuous chaos in production.
Principles for production chaos
- Start in staging, graduate to production slowly
- Start small: kill 1 pod, not 50%
- Define a steady state and abort if breached
- Always have a kill switch (Chaos Mesh pause button)
- Monitor SLOs during experiments — if budget starts burning fast, stop
Automated chaos with steady-state checks
# Chaos Workflow with abort conditions
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: continuous-resilience
spec:
templates:
- name: pod-kill-with-guard
type: Serial
tasks:
- check-slo-before # abort if SLO already burning
- pod-kill # run experiment
- check-slo-after # verify SLO held
- name: check-slo-before
type: Task
task:
# Custom task: query Prometheus, fail if budget < 50%
container:
image: curlimages/curl
command:
- sh
- -c
- |
BUDGET=$(curl -s "http://prometheus.internal/api/v1/query" \
--data-urlencode 'query=slo:error_budget_remaining:ratio{service="payment-service"}' \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET < 0.50" | bc -l) )); then
echo "Error budget < 50%, skipping chaos"
exit 1
fiChaos engineering maturity model
| Level | Practice | |-------|---------| | 1 — Reactive | Only learn from real incidents | | 2 — Game Days | Quarterly announced chaos in staging | | 3 — Proactive staging | Weekly automated chaos in staging | | 4 — Production chaos | Continuous low-impact chaos in production with SLO guards | | 5 — System chaos | Multi-region failure scenarios, dependency removal experiments |
Most organizations should target Level 3. Level 4+ requires strong observability, incident response maturity, and organizational trust.
The Reliability Loop
The platform team's role in reliability:
Platform provides: Teams use to:
───────────────── ─────────────────
SLO CRDs (Pyrra) → Define their reliability targets
Error budget dashboards → Monitor their budget status
Multi-burn-rate alerts → Get warned before breaching SLA
Error budget policy in CI → Pause risky deploys when budget is low
Chaos Mesh → Run experiments against their services
Game Day templates → Structure quarterly reliability exercises
Post-incident templates → Document and share learnings
Result:
Teams own their reliability.
Incidents are fewer, shorter, and better understood.
SLAs are met with data, not hope.Chaos engineering without SLOs is just breaking things for fun. The SLO is the measure. The chaos experiment tests whether you meet it under adverse conditions. Together they're the only honest way to know if your system is actually reliable.
Enjoyed this article?
Explore the Cloud & DevOps learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.