Service degradation strategy
Detect failures, trigger graceful fallbacks, maintain partial service.
When a critical dependency fails — database, payment processor, third-party API — your first instinct might be to return a 500 error and page on-call. But if you can serve a degraded version of your feature, do that instead. This diagram models the decision tree: detect the failure, check for a fallback, decide which graceful-degradation path to take, alert ops, and auto-retry the dependency on a schedule. The goal is to keep your system working for as many users as possible while you fix the root cause.
Service degradation lives on a spectrum. A payment gateway going down? You might disable checkout but show a helpful banner. An analytics service down? Degrade to in-memory aggregation and log the data when it comes back online. A search service down? Fall back to sorting and filtering on your primary database, slower but working.
When to use this template
- Resilience planning — map out which services can degrade and which cannot; which fallbacks you already have in place (caching, replicas, read-only mode) and which are missing.
- Incident playbooks — annotate each decision point with real thresholds (how many retries before fallback? which 30-second window to retry?) so on-call follows the same path every time.
- Dependency risk audit — for each external API or critical database, ask: "Can we degrade if this fails?" If not, build a fallback before it breaks in production.
How to adapt it
Tune each decision to your architecture:
- Replace "Dependency healthy?" with your actual health check (ping the API endpoint, query the database, check the circuit breaker).
- Add your specific fallback types: cached results, stale data, read-only mode, simplified UI, or a hardcoded response.
- Customize the retry schedule — 30 seconds works for many systems, but you might want longer for batch services or shorter for request-critical paths.
- Add escalation steps — if retries fail N times, should you auto-rollback a recent deployment? Alert a human? Switch to backup infrastructure?
Visual edits regenerate clean code as you adapt the diagram to your dependency map.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
flowchart TD
A[Health check runs] --> B{Dependency healthy?}
B -->|Yes| C[Serve full feature]
B -->|No| D{Fallback available?}
D -->|No| E[Return 503 Unavailable]
D -->|Yes| F{Fallback type?}
F -->|Cache| G[Serve stale data]
F -->|Readonly| H[Disable writes only]
F -->|Degrade| I[Simplify feature]
G --> J[Log degradation]
H --> J
I --> J
J --> K[Alert ops team]
K --> L{Manual recovery?}
L -->|Yes| M[Restore full service]
L -->|No| N[Auto-retry in 30s]
M --> O[Update health status]
N --> A
O --> A
Frequently asked questions
- What is a service degradation strategy?
- It's a plan to keep your system partially working when a critical dependency fails. Instead of returning an error to users, you gracefully degrade: serve cached data, disable certain features, or simplify the UI. Partial availability beats total unavailability. Degrees of failure response matter — some users keep working while engineers fix the root cause.
- What are the most common fallback patterns?
- Read-only mode (disable writes, serve old data), cache fallback (serve fresh-cached response if backend is slow/down), feature simplification (stripe down to essentials), and circuit breaker (stop trying after N failures to save backend load). Choose based on your failure scenario: if the database is overloaded, reduce traffic; if an external API is down, use cached results.
- How do I decide between graceful degradation and returning an error?
- Degrade if users can still get value without the dependency. Return an error if the feature cannot work at all. A maps app that loses real-time traffic data can degrade to cached routes; a payment processor going down cannot degrade—you fail fast and escalate. Know your non-negotiables.
- How long should I retry before giving up?
- Exponential backoff is standard: 1s, 2s, 4s, 8s, up to a max like 60s or 5 minutes. For most systems, 30 seconds of retries is reasonable — if the dependency is still down after that, the outage is likely broader and manual intervention is needed. Set a ceiling so you don't mask a real problem with endless retries.
Related templates
Deployment rollback decision tree
Incident detection, severity assessment, and rollback trigger criteria.
Chaos engineering experiment
Automated chaos test phases: hypothesis, inject failure, observe, recovery.
Database backup and recovery process
Disaster recovery decision flow from incident to restore.