Question 1

What is a failover state diagram?

Accepted Answer

It shows how a system moves through states during and after a failure: from running on the primary instance, detecting a failure, switching to a secondary, and recovering back to primary. Each arrow is a trigger (failure detected, secondary synced, primary recovered), making the recovery procedure explicit so the team knows who acts when and in what order.

Question 2

Why do we need detection and reconciliation states — why not just switch instantly to secondary?

Accepted Answer

Instant failover without confirmation can turn a network blip into a disaster: the primary might be fine but unreachable, and switching to secondary creates a split-brain (two systems acting as primary). Detecting adds a confirmation check (e.g., three failed health checks). Reconciliation handles clock skew and uncommitted writes that were on primary but not yet on secondary — reconciling them prevents data loss.

Question 3

What triggers failback to primary after secondary has been live?

Accepted Answer

Usually a manual decision: 'the primary is stable now, switch back.' This diagram shows an optional Failback arrow because failback is riskier than failover — you lose writes that happened on secondary while primary was down. Some teams keep it as primary indefinitely until a scheduled maintenance window. Add conditions like 'Failback approved by on-call' if your team requires manual sign-off.

Question 4

How do I adapt this for active-active systems or canary failover?

Accepted Answer

For active-active (both primary and secondary handle traffic), add parallel states for both and show gradual traffic shift instead of instant switch. For canary failover (route 5% of traffic to secondary first), add an intermediate 'Canary' state between 'Secondary_Live' and 'Primary' to show the gradual traffic ramp. Both patterns are extensions of this basic state machine.

System failover and recovery

When to use this template

How to adapt it

Mermaid code

Frequently asked questions

Related templates

Service degradation strategy

Incident state machine

API error handling flow