Chaos engineering experiment
Automated chaos test phases: hypothesis, inject failure, observe, recovery.
Chaos engineering runs intentional failure experiments to verify that systems tolerate and recover from outages gracefully. Instead of waiting for users to report a production incident, chaos teams proactively crash pods, drop packets, inject latency, or consume resources — then observe whether the system detects the failure, fails over correctly, and heals without manual intervention.
This template maps the chaos engineering timeline: hypothesis (what do we expect to happen?), failure injection (kill a pod, simulate latency), observation (collect metrics and logs), analysis (did the system behave as expected?), and recovery (did it heal automatically?). The insights from each experiment feed into runbooks and on-call playbooks that guide real-world incident response.
When to use this template
- Chaos experiment planning — organize your test phases with time estimates so teams know what to watch for and for how long. Keeps the experiment focused and prevents it from running wild.
- Resilience architecture review — walk the team through each experiment phase and discuss what signals indicate success (latency stayed in SLO, no customer errors) vs. failure (cascade collapsed, data corruption).
- On-call training — use successful chaos experiments as runbook examples. When an engineer is paged at 3 AM about a pod crash, they've already seen the exact failure pattern and the recovery sequence in your chaos timeline.
How to adapt it
Adjust the phases and timeline to match your system:
- Expand Execution if you run multiple failure modes in parallel (e.g., pod kill AND network latency AND high CPU) to test cascading failures.
- Add a Rollback phase between Observe and Analysis if your experiment can't automatically recover (e.g., data corruption requires manual intervention).
- Extend Planning if you need approval from compliance or security before running chaos in production (government or financial systems often do).
Visual edits regenerate the Gantt syntax, so you can sketch different experiment timelines without writing date/duration math.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
gantt
title Chaos Engineering Experiment Timeline
dateFormat YYYY-MM-DD
section Planning
Define hypothesis :hyp, 2026-06-27, 1d
Baseline metrics :base, after hyp, 1d
section Execution
Inject failure (pod kill) :inject, after base, 30m
Observe impact :obs, after inject, 2h
section Analysis
Collect logs & traces :logs, after obs, 1h
Root cause analysis :rca, after logs, 2h
section Recovery
Rollback/heal system :heal, after rca, 30m
Verify system healthy :verify, after heal, 1h
section Reporting
Write runbook :book, after verify, 2h
Debrief with team :debrief, after book, 1h
Frequently asked questions
- What is chaos engineering?
- Chaos engineering deliberately injects failures into production or staging systems to test how well they tolerate and recover from outages. Instead of assuming infrastructure is reliable, chaos experiments assume failure is inevitable and verify that your system handles it gracefully — by retrying, failing over, or degrading functionality. Teams run chaos experiments regularly to build confidence in resilience before users experience real outages.
- What kind of failures should I inject?
- Start with high-impact, likely failures: pod crashes (simulating deployment rollouts or node failures), network latency (simulating geographic distance or congestion), disk full (simulating database bloat), and dependency timeout (simulating service degradation). Rank by business impact and probability. Run experiments in staging first, then production during low-traffic windows after you've gained confidence.
- How do I know if my system passed the chaos test?
- Before injecting failure, define your hypothesis: 'If a database pod crashes, the load balancer redirects traffic to replicas and requests succeed within SLO (e.g., p99 latency < 500ms).' During the test, collect metrics and logs. After, compare actual behavior to the hypothesis. If users experienced errors, your system needs better redundancy or failover logic. If latency spiked but stayed in SLO, you passed — the system is resilient to that failure.
- Can chaos engineering break production?
- Yes, if done carelessly. Mitigate by starting with staged experiments (staging, then low-traffic production windows), limiting blast radius (kill one pod at a time, not the entire service), having a rollback plan (kill the experiment after N minutes, even if the system hasn't recovered), and monitoring continuously (if things go wrong, the dashboard tells you instantly). Chaos tools like Gremlin and Pumba let you set blast radius and time limits.
Related templates
Load testing strategy flowchart
Plan baseline, ramp up, spike, and soak tests to validate system performance.
Auto-scaling decision tree
CPU, memory, request volume, and cost trade-off decisions.
Database backup and recovery process
Disaster recovery decision flow from incident to restore.