Database read replica failover strategy
Detect primary lag, promote replica, re-replicate, and minimize downtime.
Database replication keeps your system online if the primary fails — but only if the replica is caught up. This template walks through the decision: monitor lag, detect when it's safe to promote, execute the promotion, test it, and rebuild the replica pipeline. Each step is a risk point: promoting a lagged replica means data loss, skipping tests means downtime when the new primary crashes, and a failed re-sync leaves you with one database instead of two.
When to use this template
- Designing high-availability infrastructure — talk through your team's RTO and RPO (recovery time and recovery point objectives) and agree on the lag thresholds that trigger promotion. This diagram is the policy.
- Incident runbooks — when the primary goes down at 2am, your on-call team needs a clear checklist. Print this or post it in your war room as the decision tree.
- Compliance and disaster recovery planning — regulators want to see your failover strategy. This diagram is auditable evidence that you have a tested, repeatable process.
How to adapt it
Tailor the thresholds and recovery steps to your infrastructure:
- Adjust lag thresholds — if your SLA requires RPO = 0 (zero data loss), promote only when lag is < 100ms. If you can tolerate up to 1 minute of lost transactions, you can be more aggressive (lag < 5s) and still meet SLA.
- Add health checks — between "test new primary" and "resume replication", add checks for CPU, memory, and query performance. If any check fails, roll back.
- Parallel replica re-sync — if you have multiple replicas, launch their re-sync steps in parallel instead of waiting for the first to finish.
Because visual edits regenerate code, you can rename the decision diamonds, extend the recovery loop, and add your own testing steps — the result is always syntactically clean.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
flowchart TD
A[Monitor replica lag] --> B{Lag exceeds threshold?}
B -->|No| A
B -->|Yes| C[Alert on-call team]
C --> D{Promote replica?}
D -->|No| E[Wait and monitor]
E --> B
D -->|Yes| F[Promote replica to primary]
F --> G[Update connection strings]
G --> H[Test new primary]
H --> I{Accepting writes?}
I -->|No| J[Rollback, investigate]
I -->|Yes| K[Launch new replica]
K --> L[Re-sync from new primary]
L --> M{Sync complete?}
M -->|No| N[Re-sync in progress]
N --> M
M -->|Yes| O[Resume normal replication]
O --> P[End]
J --> P
Frequently asked questions
- What is replica lag and why does it matter?
- Replica lag is the delay between when data is committed to the primary database and when it's replicated to read-only replicas. If lag is high, queries on the replica return stale data — old order statuses, cached user sessions, or wrong user counts. If the primary crashes while lag is high, the replica will need to catch up after promotion, during which the new primary may be inconsistent with what clients last read.
- When do you promote a replica to become the new primary?
- Usually when the original primary has failed or is unreachable, and lag on the best replica is acceptably low (ideally near zero). If lag is high, you can wait a moment for it to catch up, or promote anyway and rebuild the failed primary afterward. The decision depends on your RTO (recovery time objective) and consistency requirements.
- Why do you need to test the new primary before resuming traffic?
- After promotion, the ex-replica is now accepting writes — the first time in its lifecycle. There can be subtle bugs: triggers that only fire on primary, replication rules that no longer apply, or configuration mismatches. A quick sanity check (insert a row, read it back, check replication lag) catches these before users hit them.
- How do I adapt this for multi-region failover?
- Add a decision after promotion: if the new primary is in a different region, check network latency and reconfigure the application load balancer to point to the new region. If you have multiple replicas, add a loop to re-replicate from the new primary in parallel rather than sequentially. Visual edits regenerate code, so you can expand the diagram with your own post-failover steps.