A comprehensive Staff SRE guide to testing production systems beyond correctness — covering load testing, fault injection, chaos engineering, canary analysis, disaster recovery drills, game days, and how to build a mature reliability testing program integrated into CI/CD.
A comprehensive Staff SRE guide to testing production systems beyond correctness — covering load testing, fault injection, chaos engineering, canary analysis, disaster recovery drills, game days, and how to build a mature reliability testing program integrated into CI/CD.
Lesson outline
Functional testing answers one question: does the code do what the specification says? Unit tests verify that calculateTotal() returns the right number. Integration tests verify that the checkout service charges the payment provider correctly. Acceptance tests verify that users can complete a purchase end-to-end. These tests are essential and non-negotiable — but they cannot tell you what happens when the payment provider responds in 8 seconds instead of 80 milliseconds, when your database runs out of connections under Black Friday load, or when the network between your application tier and your cache drops 5% of packets.
Reliability testing answers a fundamentally different question: does the system survive the real world? The real world is adversarial in ways that specifications never capture. Disks fill up. Memory leaks accumulate. Third-party APIs degrade gracefully, then ungracefully, then stop responding entirely. Traffic spikes arrive without warning on Saturday mornings when no engineers are online. Hardware fails. Configuration changes go wrong. Reliability testing is the engineering discipline of creating these conditions deliberately, in a controlled environment, before they happen to real users.
The Fundamental Gap: Correctness vs Resilience
A correct system does the right thing when everything works. A resilient system does the right thing when things break. Most testing programs invest almost entirely in the former and almost nothing in the latter. The cost of this imbalance is paid in production: in three-hour incident calls, in SLO violations, in user churn, and in the bone-deep exhaustion of on-call engineers who keep seeing the same failures that could have been caught in a test environment.
The test pyramid — unit tests at the base, integration tests in the middle, end-to-end tests at the top — is a useful framework for functional testing. Reliability testing adds a fourth dimension that cuts across all layers: resilience tests that ask "what happens when this dependency fails?" at every level. A unit test for retry logic. An integration test that injects network latency between services. A load test that validates the system under 10x normal traffic. A chaos experiment that kills a random pod in staging while load is running. A game day that simulates a full datacenter failure.
What breaks in production that functional tests miss
The Reliability Test Pyramid
Think of reliability testing as its own pyramid. At the base: automated load tests in CI that validate performance baselines on every build. In the middle: fault injection tests that verify individual service resilience to specific failures. At the top: chaos experiments and game days that validate system-wide behavior under realistic failure scenarios. A mature reliability program has all three layers running continuously.
The "It Works in Staging" Fallacy
Staging environments are typically at 10% of production capacity, serve no real users, have no production data patterns, and have all dependencies always available. Passing staging tests tells you almost nothing about production reliability. The goal of reliability testing is to bring production conditions into your testing environment — or to test safely in production itself.
The reliability testing maturity ladder — where does your team sit?
01
Level 0 — No reliability testing: All reliability knowledge is discovered in production incidents. No load tests, no fault injection, no chaos. Every outage is a surprise.
02
Level 1 — Ad hoc testing: Some engineers run load tests before major launches. A few chaos experiments have been run manually. Nothing is automated or repeatable.
03
Level 2 — Scheduled testing: Monthly load tests against staging. Documented fault injection procedures. Periodic game days. Results are reviewed but not acted on systematically.
04
Level 3 — Automated and integrated: Load tests run in CI on every significant build. Fault injection is automated and runs weekly. Game days are quarterly. Results feed into SLO dashboards.
05
Level 4 — Continuous reliability validation: Production traffic is used for testing via shadowing and dark launches. Automated canary analysis gates every deployment. Chaos experiments run continuously in production with automatic blast radius control.
Level 0 — No reliability testing: All reliability knowledge is discovered in production incidents. No load tests, no fault injection, no chaos. Every outage is a surprise.
Level 1 — Ad hoc testing: Some engineers run load tests before major launches. A few chaos experiments have been run manually. Nothing is automated or repeatable.
Level 2 — Scheduled testing: Monthly load tests against staging. Documented fault injection procedures. Periodic game days. Results are reviewed but not acted on systematically.
Level 3 — Automated and integrated: Load tests run in CI on every significant build. Fault injection is automated and runs weekly. Game days are quarterly. Results feed into SLO dashboards.
Level 4 — Continuous reliability validation: Production traffic is used for testing via shadowing and dark launches. Automated canary analysis gates every deployment. Chaos experiments run continuously in production with automatic blast radius control.
Load testing is the practice of applying artificial traffic to your system to measure performance characteristics, identify bottlenecks, and determine capacity limits before real users encounter them. The goal is not to prove that your system works — you have functional tests for that. The goal is to find where it breaks, how it breaks, and how gracefully it breaks. A system that degrades predictably under load is far more operable than one that falls over catastrophically without warning.
| Load Test Type | Purpose | Duration | Traffic Pattern | What It Finds |
|---|---|---|---|---|
| Smoke test | Verify basic functionality under minimal load | 5–10 minutes | 1–5% of expected peak | Obvious configuration errors, basic regressions in new deployments |
| Load test | Validate performance at expected production load | 30–60 minutes | 80–100% of expected peak | P99 latency targets, error rates, resource utilization patterns |
| Stress test | Find the breaking point of the system | Until failure | Ramp from 100% to 200%+ of peak | Maximum throughput, failure mode, recovery behavior after overload |
| Spike test | Validate behavior during sudden traffic bursts | 10–30 minutes | Instant jump to 5–10x normal load | Autoscaling lag, thundering herd, queue buildup and recovery |
| Soak test (endurance) | Find resource leaks and drift under sustained load | 12–72 hours | 70–80% of expected peak | Memory leaks, connection pool exhaustion, disk accumulation, log rotation failures |
Choosing the right load testing tool matters less than running load tests at all — but the tool choice does affect what is easy and what is hard. The three most widely used open-source options each occupy a different niche. k6 is the modern default for teams with developers who know JavaScript. Apache JMeter is the incumbent with the broadest feature set but the steepest learning curve. Locust is the Python-native choice, best for teams that need maximum programmability and already live in the Python ecosystem.
k6: The FAANG Default for Modern Load Testing
k6 (Grafana k6) is written in Go with a JavaScript API. Tests are JavaScript scripts that define virtual users and scenarios. k6 produces excellent output: request rate, latency percentiles (p50, p95, p99, p999), error rates, and data transfer metrics. It integrates natively with Prometheus, Grafana, and Datadog. It can run distributed across multiple machines for massive scale tests. The CLI output is clean enough to read in a terminal during a live test.
Interpreting load test results — what each metric tells you
Common Load Testing Mistake: Testing the Wrong Thing
The most common load testing mistake is testing the load generator, not the system under test. If your k6 script sends 10,000 RPS but all requests hit a static endpoint that returns a hardcoded string, you have proven nothing about your application. Load test scripts must exercise realistic user journeys: authentication, database reads and writes, cache misses, third-party API calls. Profile a real production trace and use it as the basis for your load test scenario.
Establishing Performance Baselines
The output of every load test should update a performance baseline document: "as of commit abc123, the service handles 5,000 RPS with p99 latency under 200ms and less than 0.01% errors." Every future load test result is compared to this baseline. If p99 regresses by more than 20% without a corresponding feature justification, it is a performance bug to be fixed before deployment. This is performance regression testing applied to reliability.
Running a production-representative load test — step by step
01
Baseline your production traffic: capture real request distributions (endpoint mix, payload sizes, session patterns) from production logs or APM traces. Do not invent synthetic traffic patterns.
02
Size your test environment: staging should be at least 20–30% of production capacity for load tests to produce meaningful results. A 10-node staging cluster tested with production traffic volume will just tell you it is undersized.
03
Define success criteria before running: specify the acceptable p50, p99 latency, error rate, and minimum throughput. Write these down. Do not decide post-hoc whether the results are acceptable.
04
Run a smoke test first: 5% of target load for 5 minutes. Verify the test is hitting the right endpoints, generating realistic traffic, and that your monitoring dashboards are receiving data.
05
Ramp up gradually: increase load in 10–20% increments every 5 minutes. Gradual ramp-up distinguishes the steady-state performance ceiling from the thundering herd effect of sudden load.
06
Observe and record: watch CPU, memory, database connection counts, garbage collection metrics, error rates, and latency percentiles in real time. Note the load level at which each resource begins to saturate.
07
Run the soak phase: hold target load for at least 30 minutes after the ramp-up. Shorter tests miss memory leaks, connection pool drift, and log accumulation.
08
Document findings: update the baseline, file performance bugs for any regressions, and record the breaking point for capacity planning.
Baseline your production traffic: capture real request distributions (endpoint mix, payload sizes, session patterns) from production logs or APM traces. Do not invent synthetic traffic patterns.
Size your test environment: staging should be at least 20–30% of production capacity for load tests to produce meaningful results. A 10-node staging cluster tested with production traffic volume will just tell you it is undersized.
Define success criteria before running: specify the acceptable p50, p99 latency, error rate, and minimum throughput. Write these down. Do not decide post-hoc whether the results are acceptable.
Run a smoke test first: 5% of target load for 5 minutes. Verify the test is hitting the right endpoints, generating realistic traffic, and that your monitoring dashboards are receiving data.
Ramp up gradually: increase load in 10–20% increments every 5 minutes. Gradual ramp-up distinguishes the steady-state performance ceiling from the thundering herd effect of sudden load.
Observe and record: watch CPU, memory, database connection counts, garbage collection metrics, error rates, and latency percentiles in real time. Note the load level at which each resource begins to saturate.
Run the soak phase: hold target load for at least 30 minutes after the ramp-up. Shorter tests miss memory leaks, connection pool drift, and log accumulation.
Document findings: update the baseline, file performance bugs for any regressions, and record the breaking point for capacity planning.
Fault injection is the practice of deliberately introducing specific failure conditions into a system to verify that it handles them correctly. Where load testing asks "how much can the system handle?", fault injection asks "what happens when specific things go wrong?" The goal is to verify your resilience mechanisms — circuit breakers, timeouts, retries, fallbacks, graceful degradation — are actually working as designed, not just as documented.
The failure modes that matter in production are not the ones your engineers anticipate when writing code. They are the ones that emerge from the interaction of multiple components, each individually healthy, combining in unexpected ways. Fault injection is how you discover these interactions before they find you.
The failure mode taxonomy — what to inject
tc netem: The Swiss Army Knife of Network Fault Injection
Linux traffic control (tc) with the netem queuing discipline is the standard tool for network fault injection at the OS level. Key commands: add 200ms latency: tc qdisc add dev eth0 root netem delay 200ms. Add jitter: tc qdisc add dev eth0 root netem delay 200ms 50ms. Add packet loss: tc qdisc add dev eth0 root netem loss 5%. Add corruption: tc qdisc add dev eth0 root netem corrupt 1%. Remove: tc qdisc del dev eth0 root. tc netem works at the network interface level and affects all traffic through that interface.
Fault Injection Libraries for Application-Level Testing
For testing failure handling within your application code (rather than at the network layer), use purpose-built fault injection libraries. Toxiproxy (Shopify) is a TCP proxy that you route traffic through; it supports latency, bandwidth limits, packet loss, and connection resets via a REST API, controllable from your test code. Istio (service mesh) has native fault injection in traffic policies — inject HTTP errors or delays for specific routes without touching application code. Chaos Toolkit provides a declarative Python-based framework for defining and running fault injection experiments.
Running a fault injection test — the FIRE framework
01
Failure mode: Specify exactly what failure you are injecting. "Network issues between the API service and the database" is not specific enough. "100ms of added latency on all TCP connections from api-service to db-primary:5432" is specific enough to reproduce and automate.
02
Impact hypothesis: State your prediction before injecting. "The API service will timeout database connections after 500ms and return cached data from Redis for read requests, returning HTTP 503 for write requests." This prevents you from retroactively deciding the result was acceptable.
03
Run and observe: Inject the fault with a defined blast radius (one host, one service, one route) while monitoring all relevant metrics. Capture the actual behavior: did the circuit breaker trip? Did retries fire? What did users see?
04
Evidence collection: Screenshot dashboards, capture logs, record the duration. Good fault injection test reports include: fault injected, blast radius, hypothesis, actual behavior, dashboard screenshots, next actions.
Failure mode: Specify exactly what failure you are injecting. "Network issues between the API service and the database" is not specific enough. "100ms of added latency on all TCP connections from api-service to db-primary:5432" is specific enough to reproduce and automate.
Impact hypothesis: State your prediction before injecting. "The API service will timeout database connections after 500ms and return cached data from Redis for read requests, returning HTTP 503 for write requests." This prevents you from retroactively deciding the result was acceptable.
Run and observe: Inject the fault with a defined blast radius (one host, one service, one route) while monitoring all relevant metrics. Capture the actual behavior: did the circuit breaker trip? Did retries fire? What did users see?
Evidence collection: Screenshot dashboards, capture logs, record the duration. Good fault injection test reports include: fault injected, blast radius, hypothesis, actual behavior, dashboard screenshots, next actions.
iptables for Fast Dependency Blocking
To instantly block all traffic to a downstream dependency during a fault injection test: iptables -A OUTPUT -d <dependency-ip> -j DROP. To restore: iptables -D OUTPUT -d <dependency-ip> -j DROP. Alternatively, block a specific port: iptables -A OUTPUT -d <dependency-ip> -p tcp --dport 5432 -j DROP. This is faster and more precise than stopping the dependency service, and it simulates a real network partition rather than a clean shutdown.
Test Timeouts, Not Just Availability
Most engineering teams test the case where a dependency is fully down (connection refused). Almost no teams test the case where a dependency is up but slow (hanging connections). The slow dependency case is more dangerous: connection refused trips circuit breakers quickly; a hanging connection holds threads, connections, and file descriptors for the full timeout duration. Verify that every external call in your application has an explicit timeout configured — and then inject that exact timeout scenario to confirm the timeout actually fires.
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its capability to withstand turbulent conditions in production. This definition, from the Principles of Chaos Engineering (principlesofchaos.org), contains two critical words: "experimenting" and "confidence." Chaos Engineering is not about randomly breaking things. It is a scientific method applied to distributed systems: form a hypothesis, design an experiment with controlled variables, measure the output, update your understanding of the system.
The Netflix Chaos Monkey, released in 2011, is the most famous early chaos engineering tool. It randomly terminated EC2 instances in Netflix's production environment during business hours. The explicit goal was to force Netflix engineers to build systems that were resilient to instance failures — because if you know a random instance might die at any time, you design for it. Chaos Monkey worked: Netflix's ability to handle AWS region failures improved dramatically because the practice of dealing with random instance failures was baked into their operational culture.
The Chaos Engineering Hypothesis-Driven Approach
Every chaos experiment must have a hypothesis stated before the experiment runs: "We believe that when one of three instances of the payments service is terminated, the load balancer will redirect traffic to the remaining two instances within 10 seconds, and the p99 latency for payment requests will not exceed 500ms." After the experiment, you compare actual behavior to the hypothesis. If the hypothesis was wrong, you have discovered a real weakness. If it was right, you have evidence of resilience.
graph TD
A[Define Steady State<br/>What does healthy look like?] --> B[Form Hypothesis<br/>What should happen when X fails?]
B --> C[Define Blast Radius<br/>Which systems, how many instances, which users?]
C --> D[Design Experiment<br/>What to inject, how to inject, for how long]
D --> E[Run in Staging First<br/>Validate the experiment itself works]
E --> F{Staging Results OK?}
F -->|Yes| G[Run in Production<br/>With monitoring and abort criteria]
F -->|No| H[Fix the Experiment<br/>or Fix the System]
H --> E
G --> I[Observe and Measure<br/>Compare actual vs hypothesis]
I --> J{Hypothesis Confirmed?}
J -->|Yes| K[Document Confidence<br/>System behaves as expected]
J -->|No| L[File Reliability Bug<br/>Fix the weakness found]
K --> M[Expand Blast Radius<br/>Increase scope gradually]
L --> N[Fix → Retest]The chaos engineering experiment lifecycle. Every experiment follows this loop: hypothesis, controlled blast radius, observe, compare to hypothesis, act on the findings.
Blast radius control — the non-negotiable safety discipline
The Steady State Definition
Before every chaos experiment, formally define steady state: the specific measurable properties of the system when it is healthy. Example: "Steady state = checkout success rate > 99.9%, p99 latency < 300ms, cart service CPU < 60%, all health checks green." The experiment is valid only if you confirm steady state at the start. If the system is already degraded when you begin, your experiment results are uninterpretable.
Chaos Toolkit and AWS Fault Injection Simulator
For teams building a chaos engineering practice, start with managed tools rather than writing injection scripts from scratch. AWS Fault Injection Simulator (FIS) provides native AWS-aware chaos experiments: terminate EC2 instances, throttle ECS tasks, inject latency into RDS queries, simulate AZ outages. It integrates with CloudWatch for automated stop conditions. Chaos Toolkit (open source) provides a declarative YAML-based experiment format that works across cloud providers and tools. Both produce structured experiment reports suitable for compliance documentation.
Chaos Engineering Is Not a Substitute for Fixing Known Issues
Chaos Engineering is for discovering unknown weaknesses, not for validating that known problems are not as bad as you think. If you know your service lacks a circuit breaker, write the circuit breaker — do not run a chaos experiment to see what happens without one. Chaos engineering should be run on systems you believe are resilient, to discover the ways they are not resilient that you do not yet know about.
A canary deployment sends a small fraction of production traffic (typically 1–5%) to a new version of a service while the majority of traffic continues to the stable version. The name comes from the historical practice of miners carrying canary birds into coal mines — if the canary died from toxic gas, miners knew to evacuate. In software, if the canary deployment shows elevated error rates or latency regression, engineers roll back before the bad version reaches all users.
Manual canary analysis — engineers watching dashboards and deciding whether the canary looks healthy — does not scale. A single canary deployment might require watching 20 different metrics across multiple dashboards for 30 minutes. When deployments happen 10 times per day across 50 services, manual analysis is impossible. Automated canary analysis uses statistical comparison to determine whether the canary version is better than, equivalent to, or worse than the baseline version — and to make the rollout/rollback decision automatically.
How automated canary analysis works
Spinnaker Canary Analysis: The FAANG Standard
Spinnaker (Netflix, open source) with Kayenta (canary analysis service) is the reference implementation of automated canary analysis. Kayenta fetches metrics from Prometheus, Datadog, or Stackdriver, runs statistical comparisons between baseline and canary pods, and produces a score from 0 to 100. Spinnaker pipelines gate each rollout increment on the Kayenta score. A canary scoring below 75 is automatically rolled back; scoring 75–90 triggers a human review gate; scoring above 90 continues the rollout. These thresholds are configurable per service.
Canary Analysis Window: How Long Is Long Enough?
The analysis window must be long enough to observe statistically meaningful behavior. For most web services, 10–15 minutes is the minimum; 30 minutes is standard; 1 hour is appropriate for services with low traffic volume or infrequent error patterns. For services with strong time-of-day traffic patterns (e.g., peak traffic during business hours), run the canary during peak traffic, not at 3am when traffic is minimal and anomalies are hard to distinguish from noise.
Implementing SLO-based canary gates — the checklist
01
Define per-service canary SLO thresholds: the canary version must maintain error rate below X% and p99 latency below Y ms. These thresholds can be stricter than the production SLO to catch regressions early.
02
Configure both absolute gates (SLO-based) and comparative gates (canary vs baseline). Absolute gates catch outright failures. Comparative gates catch subtle regressions that are below the SLO threshold but significantly worse than baseline.
03
Include business metrics alongside technical metrics. For an e-commerce service, add cart conversion rate and checkout completion rate to the canary analysis. A code change that improves p99 latency but reduces conversion rate is still a bad change.
04
Automate rollback, not just alerting. An alert that tells an engineer "the canary looks bad" requires a human to initiate rollback, which takes 5–15 minutes. An automated rollback triggered by the same signal takes 30 seconds. Every minute the bad canary serves traffic, it accumulates error budget.
05
Run a post-canary analysis review after every rollout. Why did the canary pass or fail? Were the thresholds appropriate? Did the automated analysis agree with what a human observer would have decided? Use these reviews to continuously calibrate your analysis.
Define per-service canary SLO thresholds: the canary version must maintain error rate below X% and p99 latency below Y ms. These thresholds can be stricter than the production SLO to catch regressions early.
Configure both absolute gates (SLO-based) and comparative gates (canary vs baseline). Absolute gates catch outright failures. Comparative gates catch subtle regressions that are below the SLO threshold but significantly worse than baseline.
Include business metrics alongside technical metrics. For an e-commerce service, add cart conversion rate and checkout completion rate to the canary analysis. A code change that improves p99 latency but reduces conversion rate is still a bad change.
Automate rollback, not just alerting. An alert that tells an engineer "the canary looks bad" requires a human to initiate rollback, which takes 5–15 minutes. An automated rollback triggered by the same signal takes 30 seconds. Every minute the bad canary serves traffic, it accumulates error budget.
Run a post-canary analysis review after every rollout. Why did the canary pass or fail? Were the thresholds appropriate? Did the automated analysis agree with what a human observer would have decided? Use these reviews to continuously calibrate your analysis.
Canary Analysis Requires Good Metrics Infrastructure
Automated canary analysis is only as good as the metrics it analyzes. If your service does not emit per-pod error rates and latency histograms to a metrics backend, canary analysis has nothing to compare. Before implementing automated canary analysis, ensure: every service emits standard metrics (request count, error count, latency histogram) with pod-level labels; your metrics backend retains high-cardinality pod-level metrics; your deployment tool can route traffic at the pod level. This infrastructure investment pays for itself in the first prevented production outage.
Recovery Time Objective (RTO) is the maximum acceptable time to restore a service after a failure. Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time — how old can the data be when you restore from backup? A disaster recovery plan that has never been tested is not a plan; it is a hypothesis. The only way to know your actual RTO and RPO is to measure them during a test, not estimate them from documentation.
The most common DR testing failure is discovering during a real disaster that the documented recovery procedure has not been followed for months, has dependencies that no longer exist, or requires access to systems that are themselves part of the disaster. DR tests reveal these gaps before they become catastrophic.
The three tiers of disaster recovery testing
The Backup Restoration Test Mandate
A backup that has never been restored is not a backup — it is an untested hypothesis. Backup files can be corrupt. Backup processes can silently fail while reporting success. Restoration procedures can depend on tooling that is no longer available. At minimum, restoration tests must run monthly for all critical databases. The test is not complete until data integrity is verified: not just "the database started" but "querying the restored database returns the expected data from the last known-good backup timestamp."
| DR Test Type | Cost | Frequency | What It Tests | Output |
|---|---|---|---|---|
| Tabletop exercise | Low (team time only) | Quarterly | Documentation completeness, team knowledge, escalation paths | Action items list, updated runbooks |
| Backup restoration test | Low-Medium | Monthly | Backup integrity, restoration procedure, data completeness | Measured restoration time, data integrity report |
| Failover drill (component) | Medium | Monthly (per critical service) | Individual failover procedures, RTO for specific components | Measured RTO, deviation log |
| Regional failover test | High | Quarterly | Cross-region failover, DNS, load balancer routing, all services | Full RTO and RPO measurement, gap report |
| Full DR test | Very High | Annually | Complete disaster scenario, organization-wide coordination, external dependencies | RTO/RPO vs targets, comprehensive gap report |
RTO and RPO Measurement During DR Tests
During every DR test, assign a dedicated timer and note-taker. Record the exact wall-clock time for each step: disaster declared, first response action taken, failover initiated, failover completed, first successful health check, full service restoration. The time from "disaster declared" to "full service restoration" is your measured RTO. Compare it to your RTO target. Any gap between measured and target RTO is a reliability debt that must be paid before the next test.
Conducting a DR test — the pre-flight checklist
01
Notify all stakeholders: engineering, product, support, and executive leadership. Schedule during a low-traffic window for non-critical tests. For high-fidelity tests, consider running during normal hours to test realistic conditions.
02
Freeze all non-emergency changes: no deployments, no config changes, no infrastructure modifications for 24 hours before and during the test. You need a stable baseline to measure from.
03
Assign roles before starting: incident commander (coordinates the test), scribe (documents every action and timestamp), observers (watch metrics and dashboards), and a "break glass" engineer who has authority to abort if real damage is occurring.
04
Confirm current state: take a snapshot of all relevant metrics, record the current database state, confirm backup timestamps. This is the baseline you will compare against after restoration.
05
Execute and measure: follow the DR runbook exactly, recording actual execution time for each step. Note any steps that differ from the documented procedure.
06
Verify recovery, not just availability: after services are restored, run a test suite against the DR environment. Query the database for known records. Process a test transaction end-to-end. "Services are up" is not recovery; "services are up and producing correct results" is recovery.
07
Debrief within 24 hours: write a DR test report with measured RTO and RPO, deviations from documented procedures, action items with owners and deadlines, and sign-off from engineering and product leadership.
Notify all stakeholders: engineering, product, support, and executive leadership. Schedule during a low-traffic window for non-critical tests. For high-fidelity tests, consider running during normal hours to test realistic conditions.
Freeze all non-emergency changes: no deployments, no config changes, no infrastructure modifications for 24 hours before and during the test. You need a stable baseline to measure from.
Assign roles before starting: incident commander (coordinates the test), scribe (documents every action and timestamp), observers (watch metrics and dashboards), and a "break glass" engineer who has authority to abort if real damage is occurring.
Confirm current state: take a snapshot of all relevant metrics, record the current database state, confirm backup timestamps. This is the baseline you will compare against after restoration.
Execute and measure: follow the DR runbook exactly, recording actual execution time for each step. Note any steps that differ from the documented procedure.
Verify recovery, not just availability: after services are restored, run a test suite against the DR environment. Query the database for known records. Process a test transaction end-to-end. "Services are up" is not recovery; "services are up and producing correct results" is recovery.
Debrief within 24 hours: write a DR test report with measured RTO and RPO, deviations from documented procedures, action items with owners and deadlines, and sign-off from engineering and product leadership.
A game day is a planned, coordinated exercise in which a team deliberately introduces failure scenarios into a system and practices responding to them — as a team, in real time, with all the communication and decision-making pressure of a real incident. The term comes from the Google DiRT (Disaster Recovery Testing) program, one of the most sophisticated reliability exercise programs in the industry. DiRT exercises are run annually at Google and have included scenarios like simulating a full data center loss, losing the internal DNS system, and losing the central authentication service.
The crucial distinction between a game day and fault injection or chaos engineering is the human element. Fault injection tests the system. A game day tests the team: the on-call response playbook, the communication between teams, the decision-making under pressure, the tools used to diagnose incidents, and the escalation paths. A system that handles a database failure gracefully is only as resilient as the team's ability to diagnose and respond when something goes wrong in a way the system does not handle automatically.
The Google DiRT Program
Google's Disaster and Recovery Testing (DiRT) program runs large-scale exercises that simulate catastrophic failures across multiple services simultaneously. DiRT scenarios are kept secret from participating teams until the exercise begins, to ensure realistic response conditions. Scenarios have included: simulating the loss of the central identity management system (no one can authenticate), simulating a building evacuation (the on-call engineer has to hand off mid-incident), and simulating the loss of internal tooling including the incident management system itself. DiRT findings directly drive infrastructure investment and process improvements.
Planning a game day — the preparation timeline
01
Four weeks before: Select the scenario and define learning objectives. Example scenario: "Primary database for the user service becomes unavailable for 45 minutes during peak traffic." Learning objectives: validate automatic failover works, measure time to detect, validate runbook accuracy, observe cross-team communication.
02
Three weeks before: Identify participants. You need: a facilitator (runs the exercise, knows the scenario), an observer team (watches metrics and records actions without helping), the engineering team that will respond (does not know the scenario), and engineering leadership (available to authorize escalations).
03
Two weeks before: Design the failure injection. Decide exactly what you will do, when you will do it, and how you will restore the system if something goes wrong. Have a documented abort procedure. Brief all non-responding participants on the scenario and their role.
04
One week before: Run a pre-game-day check: verify all monitoring is working, confirm the failure injection method is safe and reversible, notify adjacent teams so they are not alarmed by anomalies, freeze all non-essential changes.
05
The day before: Send a final briefing to all observers and facilitators. Confirm the communications channel for the exercise (a dedicated Slack channel, a bridge line). Confirm the abort criteria and who has authority to call an abort.
06
Day of the game day: Begin with a 15-minute orientation for the responding team — tell them only that a reliability exercise will occur sometime in the next 2 hours and that they should respond as they would to a real incident.
07
After: Run a structured debrief within 2 hours. The facilitator leads a chronological review of the exercise. Every participant answers: what went well, what surprised them, and what would they do differently. Document all findings and create action items with owners and deadlines.
Four weeks before: Select the scenario and define learning objectives. Example scenario: "Primary database for the user service becomes unavailable for 45 minutes during peak traffic." Learning objectives: validate automatic failover works, measure time to detect, validate runbook accuracy, observe cross-team communication.
Three weeks before: Identify participants. You need: a facilitator (runs the exercise, knows the scenario), an observer team (watches metrics and records actions without helping), the engineering team that will respond (does not know the scenario), and engineering leadership (available to authorize escalations).
Two weeks before: Design the failure injection. Decide exactly what you will do, when you will do it, and how you will restore the system if something goes wrong. Have a documented abort procedure. Brief all non-responding participants on the scenario and their role.
One week before: Run a pre-game-day check: verify all monitoring is working, confirm the failure injection method is safe and reversible, notify adjacent teams so they are not alarmed by anomalies, freeze all non-essential changes.
The day before: Send a final briefing to all observers and facilitators. Confirm the communications channel for the exercise (a dedicated Slack channel, a bridge line). Confirm the abort criteria and who has authority to call an abort.
Day of the game day: Begin with a 15-minute orientation for the responding team — tell them only that a reliability exercise will occur sometime in the next 2 hours and that they should respond as they would to a real incident.
After: Run a structured debrief within 2 hours. The facilitator leads a chronological review of the exercise. Every participant answers: what went well, what surprised them, and what would they do differently. Document all findings and create action items with owners and deadlines.
Scenario types for game days — from low to high complexity
Measuring Game Day Learning Outcomes
Game days are only valuable if they produce measurable improvements. Track: time from fault injection to alert fire (detection latency), time from alert to first response action (acknowledgment latency), time from incident declaration to service restoration (MTTR), number of runbook steps that were incorrect or missing, number of cross-team escalations required versus expected, participant confidence scores before and after (self-reported). Compare these metrics across game days over time. Improving MTTR from 45 minutes to 20 minutes across three game days is the kind of evidence that justifies the investment.
The Facilitator Role: Guide, Not Helper
The game day facilitator has one primary responsibility: keep the exercise running within scope without helping the responding team solve the problem. Facilitators inject the failures, manage the timeline, watch for safety issues, and call an abort if needed. They do not answer questions, provide hints, or assist with diagnosis. If the responding team is stuck and the exercise is revealing a genuine gap, that gap is exactly what the game day was designed to find. The facilitator documents it and moves on.
The only environment that is perfectly representative of production is production. All the traffic patterns, data volumes, user behaviors, hardware configurations, network topologies, and dependency interactions are real. Staging environments, no matter how carefully maintained, are approximations. The logical conclusion — run tests against production — sounds terrifying to most engineering teams. But it is actually standard practice at Netflix, Google, and every other FAANG company, when done with appropriate safeguards.
Testing in production does not mean running load tests against your homepage or killing random servers during peak traffic without warning. It means using production traffic as a testing signal while carefully controlling the risk to users. The key techniques — traffic shadowing, dark launching, and feature flags — are designed to extract production-representative test results while protecting users from the effects of bad code.
Production testing techniques and their safety profiles
The Netflix Testing in Production Philosophy
Netflix has codified testing in production into their engineering culture under the motto "test in production, but safely." Their approach relies on: Zuul (edge proxy) for traffic splitting and shadowing; Hystrix (now resilience4j) for circuit breakers that limit the blast radius of failures; Chaos Monkey for continuous production resilience validation; and Kayenta for automated canary analysis. Every Netflix engineer understands that production is the only real test environment — the discipline is in how you limit the risk while extracting the signal.
Feature Flags as Reliability Infrastructure
Feature flags (also called feature toggles) are not just for A/B testing or gradual rollouts. They are reliability infrastructure. Every major feature should be behind a flag that can be disabled in under 30 seconds without a deployment. This turns a "rollback the entire service" scenario into a "flip a flag" scenario. The MTTR for a bad feature goes from 20 minutes (detect, prepare rollback, deploy rollback, verify) to 30 seconds (detect, flip flag, verify). Feature flags are one of the highest ROI reliability investments an engineering team can make.
Testing in Production Is Not Permission to Skip Pre-Production Testing
Testing in production complements pre-production testing — it does not replace it. Functional correctness must be validated before code reaches production. Load and fault injection testing in staging catches capacity and resilience issues before they affect users. Testing in production provides the final layer of validation: real traffic patterns, real data, real user behavior. The appropriate mental model is defense in depth: multiple testing layers, each catching different categories of issues.
Implementing production traffic shadowing — the technical approach
01
Deploy a shadow instance of the new service version alongside the production instance. The shadow must be isolated: it should not write to the production database, not trigger real external API calls, and not send real emails or notifications.
02
Configure your edge proxy (Nginx, Envoy, or a service mesh) to duplicate requests to both the production and shadow instances. In Envoy, this is done with the Mirror HTTP filter. In Nginx, use the mirror directive.
03
Capture and compare responses: log the shadow's response status, latency, and response body for every request. Run automated comparison: did the shadow return the same status code as production? Was the response body semantically equivalent? Was the shadow latency within acceptable bounds?
04
Monitor shadow infrastructure separately: the shadow instance should have its own metrics, alerts, and logs. Shadow errors should not page the on-call team for production (they are expected), but should be monitored for analysis.
05
Gradually increase the shadow fraction: start with 1% of traffic shadowed. After validating the shadow infrastructure, increase to 10%, then 50%, then 100%. At 100% shadowing, you have complete production traffic flowing through the new version with zero user impact.
Deploy a shadow instance of the new service version alongside the production instance. The shadow must be isolated: it should not write to the production database, not trigger real external API calls, and not send real emails or notifications.
Configure your edge proxy (Nginx, Envoy, or a service mesh) to duplicate requests to both the production and shadow instances. In Envoy, this is done with the Mirror HTTP filter. In Nginx, use the mirror directive.
Capture and compare responses: log the shadow's response status, latency, and response body for every request. Run automated comparison: did the shadow return the same status code as production? Was the response body semantically equivalent? Was the shadow latency within acceptable bounds?
Monitor shadow infrastructure separately: the shadow instance should have its own metrics, alerts, and logs. Shadow errors should not page the on-call team for production (they are expected), but should be monitored for analysis.
Gradually increase the shadow fraction: start with 1% of traffic shadowed. After validating the shadow infrastructure, increase to 10%, then 50%, then 100%. At 100% shadowing, you have complete production traffic flowing through the new version with zero user impact.
Individual reliability tests — a one-off load test before launch, a chaos experiment run by an enthusiastic engineer, a game day conducted by the SRE team — provide value but not organizational reliability maturity. A reliability testing program is a systematic, repeatable, and continuously improving set of practices that are integrated into the development and deployment lifecycle. The difference between running reliability tests and having a reliability testing program is the difference between exercising occasionally and having a training program.
The maturity model for reliability testing programs maps roughly to the levels introduced in Section 1. But moving from Level 0 to Level 4 is a multi-year journey that requires investment in tooling, culture, and process. The single most impactful first step is always the same: automate your load tests and run them on every significant build. Everything else — fault injection, chaos engineering, game days, production traffic testing — builds on a foundation of continuous load testing.
Key metrics for measuring reliability testing program health
Integrating Reliability Testing into CI/CD
The highest leverage integration point for reliability testing is the CI/CD pipeline. Every build should run: smoke load test (5 minutes, 5% of baseline load, must pass before staging deploy). Every staging deploy should run: full load test (30 minutes, 100% of baseline load). Every production deploy should run: automated canary analysis (30 minutes, 1–5% traffic, auto-rollback on failure). This progression ensures that reliability is continuously validated without gating on infrequent manual processes.
| Program Maturity | Load Testing | Fault Injection | Chaos Engineering | Game Days | Production Testing |
|---|---|---|---|---|---|
| Level 0 — Ad hoc | Manual, pre-launch only | None | None | None | None |
| Level 1 — Repeatable | Scripted, run manually on demand | Manual procedure per service | None | Annual, informal | Synthetic monitoring only |
| Level 2 — Scheduled | Run monthly in CI, results reviewed | Scheduled quarterly per critical service | Manual experiments on top services | Quarterly, documented | Synthetics + basic RUM |
| Level 3 — Automated | Every staging deploy, gates deployment | Automated weekly for all services | Monthly automated experiments | Quarterly with formal outcomes tracking | Feature flags + canary analysis |
| Level 4 — Continuous | Every build, performance regression gates | Continuous automated injection | Continuous in production with auto-blast-radius control | Monthly, full DR test annually | Traffic shadowing + production A/B testing + full RUM SLOs |
Starting a Reliability Testing Program from Zero
If your team has no reliability testing program today, here is the 90-day plan. Days 1–30: instrument your most critical service with a k6 load test. Run it manually and establish a baseline. Add it to your CI pipeline. Days 31–60: write a chaos experiment for the top failure mode identified in your last postmortem. Run it in staging. Document the hypothesis and findings. Days 61–90: run your first game day. Use the scenario from the chaos experiment. Measure detection time and MTTR. Document the gap between actual and target MTTR. This 90-day foundation is enough to identify the most impactful reliability investments and to build organizational confidence in the approach.
The Organizational Prerequisite: Blameless Culture
No reliability testing program succeeds in a blame culture. If engineers fear that a game day exercise that reveals a gap will be used against them in a performance review, they will resist the program. If postmortems are investigations to find who made a mistake rather than systemic explorations of why the system allowed a mistake, they will produce superficial findings and no lasting improvements. Blameless postmortems, psychological safety for raising reliability concerns, and explicit recognition of engineers who identify weaknesses before they become incidents are the cultural prerequisites for a functional reliability testing program.
Building the business case for reliability testing investment
01
Calculate the cost of your last three incidents: engineering hours spent on response and recovery (at fully-loaded hourly cost), revenue lost per minute of downtime (from finance or analytics), customer churn attributed to the incidents (from customer success), and reputational cost (support tickets, social media mentions, customer escalations).
02
Estimate what each incident would have cost if caught by a reliability test: the test running time, the engineer time to write the test, and the engineer time to fix the issue in staging. This is typically 5–20x lower than the production incident cost.
03
Calculate the ROI: divide the production incident cost by the estimated test cost. An incident that costs $500,000 in lost revenue and engineering time, preventable by a $25,000 testing investment, has a 20x ROI. Present this math to engineering leadership.
04
Start with the highest-ROI investments: automated load tests (low cost, high catch rate for capacity issues) and automated canary analysis (medium cost, prevents bad deployments from reaching all users) before investing in expensive chaos engineering programs.
05
Track and report quarterly: present the number of issues caught by reliability tests before production, the estimated production cost of those issues, and the actual cost of the testing program. This data drives continued investment.
Calculate the cost of your last three incidents: engineering hours spent on response and recovery (at fully-loaded hourly cost), revenue lost per minute of downtime (from finance or analytics), customer churn attributed to the incidents (from customer success), and reputational cost (support tickets, social media mentions, customer escalations).
Estimate what each incident would have cost if caught by a reliability test: the test running time, the engineer time to write the test, and the engineer time to fix the issue in staging. This is typically 5–20x lower than the production incident cost.
Calculate the ROI: divide the production incident cost by the estimated test cost. An incident that costs $500,000 in lost revenue and engineering time, preventable by a $25,000 testing investment, has a 20x ROI. Present this math to engineering leadership.
Start with the highest-ROI investments: automated load tests (low cost, high catch rate for capacity issues) and automated canary analysis (medium cost, prevents bad deployments from reaching all users) before investing in expensive chaos engineering programs.
Track and report quarterly: present the number of issues caught by reliability tests before production, the estimated production cost of those issues, and the actual cost of the testing program. This data drives continued investment.
Reliability testing questions appear in SRE interviews at FAANG companies both as direct technical questions and embedded in system design scenarios. System design interviews will probe whether you include load testing, canary deployments, and chaos engineering in your design proposals — candidates who only describe functional architecture without discussing reliability validation are missing a critical dimension. Behavioral interviews ask about past incidents and what testing would have caught the failure earlier. Staff-level SRE interviews specifically probe whether you can design and run a reliability testing program, not just execute individual tests.
Common questions:
Key takeaways
What is the difference between a stress test and a soak test, and what distinct failure categories does each one catch?
A stress test ramps load beyond the expected production peak until the system fails, finding the breaking point and the failure mode at that point (connection pool exhaustion, CPU saturation, queue overflow). The stress test reveals the system's hard capacity limits and how it fails — catastrophically or gracefully. A soak test runs at 70–80% of expected peak load for 12–72 hours without stopping, finding failure modes that only emerge from sustained operation: memory leaks, connection pool drift, disk accumulation, log rotation failures, gradual GC pressure increase. The stress test is a sprint; the soak test is a marathon. Both are required because a system can pass a 30-minute load test at peak load and still fail after 48 hours of sustained operation at moderate load.
Why does Netflix run chaos experiments during business hours instead of off-peak windows, and what does this reveal about the purpose of chaos engineering?
Netflix runs chaos experiments during business hours because that is when the system is most stressed (highest traffic, most concurrent users, most dependencies under load) and when the most experienced engineers are available to observe and respond. Running chaos experiments at 3am on low traffic tells you how the system behaves at 3am on low traffic — not during the Friday evening peak when a real failure would be most damaging. More fundamentally, this reveals that chaos engineering is not about proving the system is broken; it is about building confidence in resilience under realistic conditions. An experiment that passes at 3am on minimal traffic provides almost no confidence about Friday evening behavior.
Amazon's 2004 pre-AWS holiday outage was caused by testing DR failover only at low-traffic windows. Explain specifically how production-representative load changes what a DR test can reveal, using the replication lag bug as an example.
At low traffic, database write volume is minimal, so the primary-to-replica replication lag stays near zero (milliseconds to seconds). The failover code that assumed near-zero lag worked correctly because the assumption was true. At production holiday load, write volume was massive, causing replication lag to grow to minutes. The same failover code, making the same near-zero-lag assumption, now made incorrect data consistency decisions. The bug was invisible at low traffic because the condition that triggered it (significant replication lag) never occurred. This illustrates a general principle: failure modes that emerge from scale, load, or sustained operation are categorically different from failures that appear at low traffic. Any test that does not match production conditions filters out the entire class of scale-dependent failure modes, which are often the most severe because they only manifest when the system is already under maximum stress.
💡 Analogy
Reliability testing is earthquake preparedness for software systems. You do not wait for an earthquake to find out whether your evacuation plan works. Buildings are tested with controlled stress on shake tables before they are occupied. Evacuation drills happen regularly, with feedback loops that improve the plan after each drill. Backup emergency systems are tested monthly. The difference between a building that survives a major earthquake and one that collapses is almost always whether it was designed and tested with earthquakes in mind — not whether the architects hoped it would be OK. In software, the difference between a team that survives a major outage (database failure, cloud provider outage, traffic spike) and one that endures a six-hour all-hands incident is almost always whether they practiced: whether they ran load tests before launch, whether they tested the failover procedure last week rather than eighteen months ago, whether they ran a game day that revealed the exact failure mode that just hit production.
⚡ Core Idea
Create controlled failures before the real ones happen. Load tests find capacity limits. Fault injection tests find resilience gaps. Chaos engineering finds unknown weaknesses. Game days train the team. DR tests validate recovery. Canary analysis gates deployments. None of these are optional in a production system that is expected to be reliable.
🎯 Why It Matters
A production system that has never been tested for reliability is a system that has been tested for reliability by its users — at the worst possible time, with no abort button. Every reliability test you run before users encounter a failure is an opportunity to fix the problem cheaply, in a controlled environment, without waking anyone up at 3am. The cost of reliability testing is a fraction of the cost of production incidents. The math is not close.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.