All templates
State template

System failover and recovery

State transitions during primary failure, failover, and recovery to primary.

When a primary system fails, you need a picture of what happens: how you detect it, when you switch to the backup, how you sync data, and when you can safely go back to primary. This state diagram makes the recovery path visible and explicit — no guessing about when to failback, whether to reconcile, or what state the system is in at 3 a.m. during an incident.

The states are the key landmarks: Primary (normal), Detecting (has something failed or just a blip?), Failover (switch happening), Secondary_Live (backup is now handling traffic), Recovering (primary is back but stale), and Reconcile (should we trust the data?). Each transition is a trigger that the on-call engineer acts on.

When to use this template

  • Designing HA infrastructure — map out your failover policy before you build it: manual vs. automatic detection, data reconciliation strategy, and when you failback.
  • Incident playbooks — put this diagram in your runbook so on-call engineers know what state the system is in and what to do next.
  • Disaster recovery drills — walk through the diagram during a drill, marking which transitions are automatic (detected and switched in 30 seconds) vs. manual (you approve failback after reviewing logs).

How to adapt it

Start by replacing "Primary" and "Secondary" with your real systems (e.g., US-East primary, US-West secondary, or read-write master and read-only replica):

  • Automatic detection — if you use heartbeat or health-check monitoring, add a note on the Detecting state: "triggered by 3 consecutive failed health checks".
  • Data consistency strategy — show whether you sync before failover (safer, slower) or after (faster, risk of data loss) by adjusting the Sync and Secondary_Ready states.
  • Multiple secondaries — if you have two secondaries, add parallel branches showing which becomes primary and which becomes the new secondary.

Mermaid code

Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.

stateDiagram-v2
    [*] --> Primary
    
    Primary --> Detecting: Primary fails
    Detecting --> Failover: Failure confirmed
    Detecting --> Primary: False alarm
    
    Failover --> Secondary: Switch to secondary
    Secondary --> Sync: Promote secondary
    Sync --> Secondary_Ready: Replication synced
    Secondary_Ready --> Secondary_Live: Traffic switched
    
    Secondary_Live --> Recovering: Primary recovered
    Recovering --> Reconcile: Data sync needed?
    Reconcile --> Primary_Sync: Sync from secondary
    Primary_Sync --> Primary: Primary ready
    
    Secondary_Live --> Primary: Failback if primary healthy

Frequently asked questions

What is a failover state diagram?
It shows how a system moves through states during and after a failure: from running on the primary instance, detecting a failure, switching to a secondary, and recovering back to primary. Each arrow is a trigger (failure detected, secondary synced, primary recovered), making the recovery procedure explicit so the team knows who acts when and in what order.
Why do we need detection and reconciliation states — why not just switch instantly to secondary?
Instant failover without confirmation can turn a network blip into a disaster: the primary might be fine but unreachable, and switching to secondary creates a split-brain (two systems acting as primary). Detecting adds a confirmation check (e.g., three failed health checks). Reconciliation handles clock skew and uncommitted writes that were on primary but not yet on secondary — reconciling them prevents data loss.
What triggers failback to primary after secondary has been live?
Usually a manual decision: 'the primary is stable now, switch back.' This diagram shows an optional Failback arrow because failback is riskier than failover — you lose writes that happened on secondary while primary was down. Some teams keep it as primary indefinitely until a scheduled maintenance window. Add conditions like 'Failback approved by on-call' if your team requires manual sign-off.
How do I adapt this for active-active systems or canary failover?
For active-active (both primary and secondary handle traffic), add parallel states for both and show gradual traffic shift instead of instant switch. For canary failover (route 5% of traffic to secondary first), add an intermediate 'Canary' state between 'Secondary_Live' and 'Primary' to show the gradual traffic ramp. Both patterns are extensions of this basic state machine.

Related templates