Incident state machine
States and transitions during an incident from detection to postmortem.
Every team has an incident process, and every team discovers gaps in it during a crisis. This template maps the journey from alert to postmortem: Detected (an alert fired or a user reported an issue), Investigating (hunting for the cause), Mitigating (applying a fix), Resolved (the service is healthy), Postmortem (the 24-hour follow-up), and Closed (action items are assigned for future prevention).
The key insight is that Investigating and Mitigating are separate states: a good incident process never waits for root cause before acting — you suppress symptoms, restore service, and hunt for the cause in parallel. Once you can see the states your incidents actually flow through, you can measure how long each phase takes and optimize the bottlenecks.
When to use this template
- Incident process documentation — post this in your runbooks so every engineer knows which state the incident is in, who owns the next step, and what they should be doing right now.
- Metrics and dashboards — measure mean-time-to-detect, mean-time-to-mitigation, and mean-time-to-postmortem by tracking transitions. Comparing these numbers across your team reveals which incidents you handle well and which ones spiral.
- Training new team members — the state machine replaces a 10-minute conversation with a visual contract: "when do we escalate, when do we call the lead, when is it safe to turn monitoring back on?"
How to adapt it
Extend the state machine to match your organization's reality:
- Add severity-based branching — p1 and p2 incidents might follow different paths (p1 goes straight to escalation, p2 stays in investigating).
- Insert a staging state between Resolved and Postmortem where you run a post-incident checklist (communication, metrics review, blame-free retrospective).
- Add transitions from any state back to Investigating if new information comes to light during Mitigating or Postmortem.
The visual editor makes it easy to reorder states and add transitions — your changes regenerate clean Mermaid state diagram syntax, so the result fits straight into your incident-management documentation.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
stateDiagram-v2
[*] --> Detected
Detected --> Investigating: Alert escalated
Investigating --> Mitigating: Root cause found
Investigating --> Escalated: Severity increased
Escalated --> Mitigating: Lead assigned
Mitigating --> Resolved: Fix deployed
Resolved --> Postmortem: 24h follow-up
Postmortem --> Closed: Action items assigned
Closed --> [*]
Frequently asked questions
- What is an incident state machine?
- It's a diagram showing every distinct phase an incident passes through, from the moment an alert fires through resolution and the postmortem. Each state represents who is doing what, and the transitions show what event or decision moves you to the next phase. It makes incident management predictable and trainable.
- Why does this template separate Investigating from Mitigating?
- Because they are fundamentally different: investigating means finding the root cause (which can take hours), while mitigating means reducing customer impact immediately (which should take minutes). Separating them clarifies whether your team is still hunting for the cause or already executing a fix.
- What if an incident gets downgraded during Escalated?
- Add a transition from Escalated back to Investigating or Mitigating — not every incident turns into a P1. A state machine is a policy, and it should match your real decision tree. If you sometimes stand down, draw that path.
- How do I adapt this for my incident severity levels?
- Expand the Investigating state into p1-investigating and p2-investigating, each with different escalation thresholds and personnel. Or add a Decision state after Detected that branches based on severity. The visual editor lets you add states and transitions without writing YAML, so you can build your exact process visually.