System failover and recovery
State transitions during primary failure, failover, and recovery to primary.
When a primary system fails, you need a picture of what happens: how you detect it, when you switch to the backup, how you sync data, and when you can safely go back to primary. This state diagram makes the recovery path visible and explicit — no guessing about when to failback, whether to reconcile, or what state the system is in at 3 a.m. during an incident.
The states are the key landmarks: Primary (normal), Detecting (has something failed or just a blip?), Failover (switch happening), Secondary_Live (backup is now handling traffic), Recovering (primary is back but stale), and Reconcile (should we trust the data?). Each transition is a trigger that the on-call engineer acts on.
When to use this template
- Designing HA infrastructure — map out your failover policy before you build it: manual vs. automatic detection, data reconciliation strategy, and when you failback.
- Incident playbooks — put this diagram in your runbook so on-call engineers know what state the system is in and what to do next.
- Disaster recovery drills — walk through the diagram during a drill, marking which transitions are automatic (detected and switched in 30 seconds) vs. manual (you approve failback after reviewing logs).
How to adapt it
Start by replacing "Primary" and "Secondary" with your real systems (e.g., US-East primary, US-West secondary, or read-write master and read-only replica):
- Automatic detection — if you use heartbeat or health-check monitoring, add a note on the Detecting state: "triggered by 3 consecutive failed health checks".
- Data consistency strategy — show whether you sync before failover (safer, slower) or after (faster, risk of data loss) by adjusting the Sync and Secondary_Ready states.
- Multiple secondaries — if you have two secondaries, add parallel branches showing which becomes primary and which becomes the new secondary.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
stateDiagram-v2
[*] --> Primary
Primary --> Detecting: Primary fails
Detecting --> Failover: Failure confirmed
Detecting --> Primary: False alarm
Failover --> Secondary: Switch to secondary
Secondary --> Sync: Promote secondary
Sync --> Secondary_Ready: Replication synced
Secondary_Ready --> Secondary_Live: Traffic switched
Secondary_Live --> Recovering: Primary recovered
Recovering --> Reconcile: Data sync needed?
Reconcile --> Primary_Sync: Sync from secondary
Primary_Sync --> Primary: Primary ready
Secondary_Live --> Primary: Failback if primary healthy
Frequently asked questions
- What is a failover state diagram?
- It shows how a system moves through states during and after a failure: from running on the primary instance, detecting a failure, switching to a secondary, and recovering back to primary. Each arrow is a trigger (failure detected, secondary synced, primary recovered), making the recovery procedure explicit so the team knows who acts when and in what order.
- Why do we need detection and reconciliation states — why not just switch instantly to secondary?
- Instant failover without confirmation can turn a network blip into a disaster: the primary might be fine but unreachable, and switching to secondary creates a split-brain (two systems acting as primary). Detecting adds a confirmation check (e.g., three failed health checks). Reconciliation handles clock skew and uncommitted writes that were on primary but not yet on secondary — reconciling them prevents data loss.
- What triggers failback to primary after secondary has been live?
- Usually a manual decision: 'the primary is stable now, switch back.' This diagram shows an optional Failback arrow because failback is riskier than failover — you lose writes that happened on secondary while primary was down. Some teams keep it as primary indefinitely until a scheduled maintenance window. Add conditions like 'Failback approved by on-call' if your team requires manual sign-off.
- How do I adapt this for active-active systems or canary failover?
- For active-active (both primary and secondary handle traffic), add parallel states for both and show gradual traffic shift instead of instant switch. For canary failover (route 5% of traffic to secondary first), add an intermediate 'Canary' state between 'Secondary_Live' and 'Primary' to show the gradual traffic ramp. Both patterns are extensions of this basic state machine.
Related templates
Service degradation strategy
Detect failures, trigger graceful fallbacks, maintain partial service.
Incident state machine
States and transitions during an incident from detection to postmortem.
API error handling flow
Client-side error handling strategies for API requests and failures.