All templates
Flowchart template

Deployment rollback decision tree

Incident detection, severity assessment, and rollback trigger criteria.

Every deployment could be the one that breaks production, so every team needs a clear decision tree for whether to roll back or attempt a hotfix. This diagram captures the moment of truth: metrics alert you to a problem, the on-call team assembles, you gather facts — is it a known issue? Can it be fixed in minutes? — and decide to either ship a hotfix or revert to the last known-good version.

The diagram reflects how real incidents work. You don't have perfect information in the first 5 minutes. Error rates might spike from a cache miss, latency might jump from a load spike that resolves itself, or a customer might report something that turns out to be on their end. The decision points force you to ask the right questions fast: Is it real? Do we understand it? Can we fix it in time?

When to use this template

  • Incident response runbooks — customize the metrics and thresholds, then post it where on-call engineers see it before they need it.
  • Deployment safety reviews — walk the team through a few hypothetical scenarios so everyone knows when to escalate to rollback.
  • Postmortem templates — use the decision path to structure your "what could we have detected earlier" discussion after an incident.

How to adapt it

Tune the decision nodes to your actual production signals:

  • Replace "Error rate spike" with your actual threshold — 5%? 10%? double the baseline?
  • Add domain-specific metrics — payment systems check transaction success, APIs check 5xx counts, frontends check JavaScript errors.
  • Annotate the hotfix time limit (e.g., "if not resolved in 10 minutes, roll back") so the decision is pre-made under stress.

Visual edits regenerate clean Mermaid code as you customize, so you can turn this into your actual incident runbook by editing the diagram directly in the editor.

Mermaid code

Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.

flowchart TD
    A[Deployment completes] --> B[Monitor key metrics]
    B --> C{Error rate spike?}
    C -->|No| D{Latency increase?}
    C -->|Yes| E[Page on-call team]
    D -->|No| F{Customer complaints?}
    D -->|Yes| E
    F -->|Yes| E
    F -->|No| G[Continue monitoring]
    E --> H{Known issue?}
    H -->|Yes| I{Fixable in minutes?}
    H -->|No| J{Severity critical?}
    I -->|Yes| K[Deploy hotfix]
    I -->|No| L[Initiate rollback]
    J -->|Yes| L
    J -->|No| K
    K --> M{Issue resolved?}
    M -->|No| L
    M -->|Yes| N[Close incident]
    L --> O[Revert to last known good]
    O --> P[Verify system healthy]
    P --> Q[Schedule postmortem]

Frequently asked questions

What metrics should trigger a rollback?
Error rate is the strongest signal — if it jumps 2-3x within minutes of deploy, that's a rollback trigger. Latency spikes also matter, especially for user-facing APIs. Customer complaints are a lagging indicator; you want metrics to catch it first. Set thresholds before a problem happens so you're not debating percentages in the chaos.
How fast should I rollback?
As fast as your automation allows — ideally under 5 minutes from decision to revert. A 30-minute incident becomes 15 minutes if you can rollback instantly. This is why teams build rollback automation and run drills; a manual rollback at 3 AM is a recipe for mistakes. Fast rollback also means you can deploy more often without fear.
What is a postmortem and why schedule one after rollback?
A postmortem is a blameless review of what went wrong, what signals you missed, and how to prevent it next time. It's separate from the incident itself — the goal during a rollback is to restore service fast, not to learn. Schedule the postmortem for the next business day when the adrenaline has worn off.
Should I rollback or hotfix?
Rollback if the root cause is unclear or if you need a known-good state fast. Hotfix if you identified a specific line of code that broke and can fix it in under 10 minutes. The diagram shows the decision: if it's fixable fast, try hotfix; if that doesn't work or you can't diagnose the cause, rollback is your safe bet.

Related templates