All templates
Flowchart template

Service degradation strategy

Detect failures, trigger graceful fallbacks, maintain partial service.

When a critical dependency fails — database, payment processor, third-party API — your first instinct might be to return a 500 error and page on-call. But if you can serve a degraded version of your feature, do that instead. This diagram models the decision tree: detect the failure, check for a fallback, decide which graceful-degradation path to take, alert ops, and auto-retry the dependency on a schedule. The goal is to keep your system working for as many users as possible while you fix the root cause.

Service degradation lives on a spectrum. A payment gateway going down? You might disable checkout but show a helpful banner. An analytics service down? Degrade to in-memory aggregation and log the data when it comes back online. A search service down? Fall back to sorting and filtering on your primary database, slower but working.

When to use this template

  • Resilience planning — map out which services can degrade and which cannot; which fallbacks you already have in place (caching, replicas, read-only mode) and which are missing.
  • Incident playbooks — annotate each decision point with real thresholds (how many retries before fallback? which 30-second window to retry?) so on-call follows the same path every time.
  • Dependency risk audit — for each external API or critical database, ask: "Can we degrade if this fails?" If not, build a fallback before it breaks in production.

How to adapt it

Tune each decision to your architecture:

  • Replace "Dependency healthy?" with your actual health check (ping the API endpoint, query the database, check the circuit breaker).
  • Add your specific fallback types: cached results, stale data, read-only mode, simplified UI, or a hardcoded response.
  • Customize the retry schedule — 30 seconds works for many systems, but you might want longer for batch services or shorter for request-critical paths.
  • Add escalation steps — if retries fail N times, should you auto-rollback a recent deployment? Alert a human? Switch to backup infrastructure?

Visual edits regenerate clean code as you adapt the diagram to your dependency map.

Mermaid code

Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.

flowchart TD
    A[Health check runs] --> B{Dependency healthy?}
    B -->|Yes| C[Serve full feature]
    B -->|No| D{Fallback available?}
    D -->|No| E[Return 503 Unavailable]
    D -->|Yes| F{Fallback type?}
    F -->|Cache| G[Serve stale data]
    F -->|Readonly| H[Disable writes only]
    F -->|Degrade| I[Simplify feature]
    G --> J[Log degradation]
    H --> J
    I --> J
    J --> K[Alert ops team]
    K --> L{Manual recovery?}
    L -->|Yes| M[Restore full service]
    L -->|No| N[Auto-retry in 30s]
    M --> O[Update health status]
    N --> A
    O --> A

Frequently asked questions

What is a service degradation strategy?
It's a plan to keep your system partially working when a critical dependency fails. Instead of returning an error to users, you gracefully degrade: serve cached data, disable certain features, or simplify the UI. Partial availability beats total unavailability. Degrees of failure response matter — some users keep working while engineers fix the root cause.
What are the most common fallback patterns?
Read-only mode (disable writes, serve old data), cache fallback (serve fresh-cached response if backend is slow/down), feature simplification (stripe down to essentials), and circuit breaker (stop trying after N failures to save backend load). Choose based on your failure scenario: if the database is overloaded, reduce traffic; if an external API is down, use cached results.
How do I decide between graceful degradation and returning an error?
Degrade if users can still get value without the dependency. Return an error if the feature cannot work at all. A maps app that loses real-time traffic data can degrade to cached routes; a payment processor going down cannot degrade—you fail fast and escalate. Know your non-negotiables.
How long should I retry before giving up?
Exponential backoff is standard: 1s, 2s, 4s, 8s, up to a max like 60s or 5 minutes. For most systems, 30 seconds of retries is reasonable — if the dependency is still down after that, the outage is likely broader and manual intervention is needed. Set a ceiling so you don't mask a real problem with endless retries.

Related templates