Database backup and recovery process
Disaster recovery decision flow from incident to restore.
Data loss is not a question of if but when. When it happens — corruption, accidental deletion, ransomware — your team needs a decision tree: severity, backup freshness, acceptable data loss, and the fastest safe path back to serving customers. This template lays out the branching logic from incident detection through staging validation to the final promotion to production.
The diagram separates minor incidents (restore from backup) from critical incidents (activate failover replica), and shows the critical validation step that prevents restoring corrupted data back into production.
When to use this template
- Disaster recovery playbooks — document your actual backup schedule, replication lag, RTO (recovery time), and RPO (recovery point objective) so everyone knows what to expect.
- Incident response training — new ops engineers need to understand the backup → staging → validation → production flow before they make a mistake under pressure.
- Compliance audits — regulators and security assessors ask for exactly this: how long do you keep backups, how quickly can you recover, and how much data loss is acceptable.
How to adapt it
Customize the severity classification to match your actual incident thresholds, and add your real backup ages and replication lag targets:
- Replace severity thresholds with your SLAs: e.g. "Critical if customer-facing data" or "Critical if any account data affected".
- Add backup restore time on the staging path so teams know how long validation will take.
- Insert a change control approval step before the production swap if your org requires it.
Visual edits regenerate the Mermaid source, so you can keep your recovery process diagram in sync with your actual infrastructure as it evolves.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
flowchart TD
A[Data loss detected] --> B[Alert ops team]
B --> C{Incident severity?}
C -->|Minor - partial loss| D[Review backup catalog]
C -->|Critical - full loss| E[Activate disaster recovery]
D --> F{Backup age?}
F -->|Recent - within SLA| G[Restore to staging]
F -->|Stale - data loss risk| H[Determine acceptable loss]
G --> I[Validate restored data]
I --> J{Data intact?}
J -->|Yes| K[Swap to production]
J -->|No| L[Try older backup]
L --> I
E --> M[Activate failover replica]
M --> N{Replication lag?}
N -->|Minor| O[Switch traffic to replica]
N -->|Significant| P[Accept data loss]
H --> P
P --> Q[Notify affected customers]
K --> R[Resume operations]
O --> R
Q --> R
Frequently asked questions
- What's the difference between backup and replication?
- Backup is a snapshot of data at a point in time, stored separately and restored when needed. Replication is a live copy kept synchronized with the primary — fast to switch to but has lag, so you may lose recent writes. Most production systems use both: replication for fast failover, backups for long-term disaster recovery.
- What does recovery point objective (RPO) mean?
- RPO is how much data you're willing to lose, measured in time. If RPO is 1 hour, you can lose up to 1 hour of recent writes. If RPO is 1 minute, you need backups (or replication lag) of at most 1 minute. This diagram shows the choice: use a recent backup (RPO = backup age) or switch to a replica (RPO = replication lag).
- Why must I validate restored data before promoting it to production?
- Because corrupted data can be in backups — silent corruption, schema incompatibilities, or application bugs that corrupt records. Restoring corrupted data back to production just spreads the corruption. Always restore to staging first, run integrity checks, and compare the integrity metrics with what you expect before making it live.
- How do I extend this for multi-region failover?
- Add a region selection decision after detecting data loss: if the primary region is down entirely, fail over to a standby region running parallel replicas. If only the database is down, stay in the same region and restore from backups. Visual edits let you branch the diagram to show both paths and keep the recovery decision tree accurate as your infrastructure grows.
Related templates
Database migration flow
Safe schema changes with validation, rollback, and production cutover.
Auto-scaling decision tree
CPU, memory, request volume, and cost trade-off decisions.
Deployment rollback decision tree
Incident detection, severity assessment, and rollback trigger criteria.