Question 1

What metrics should trigger a rollback?

Accepted Answer

Error rate is the strongest signal — if it jumps 2-3x within minutes of deploy, that's a rollback trigger. Latency spikes also matter, especially for user-facing APIs. Customer complaints are a lagging indicator; you want metrics to catch it first. Set thresholds before a problem happens so you're not debating percentages in the chaos.

Question 2

How fast should I rollback?

Accepted Answer

As fast as your automation allows — ideally under 5 minutes from decision to revert. A 30-minute incident becomes 15 minutes if you can rollback instantly. This is why teams build rollback automation and run drills; a manual rollback at 3 AM is a recipe for mistakes. Fast rollback also means you can deploy more often without fear.

Question 3

What is a postmortem and why schedule one after rollback?

Accepted Answer

A postmortem is a blameless review of what went wrong, what signals you missed, and how to prevent it next time. It's separate from the incident itself — the goal during a rollback is to restore service fast, not to learn. Schedule the postmortem for the next business day when the adrenaline has worn off.

Question 4

Should I rollback or hotfix?

Accepted Answer

Rollback if the root cause is unclear or if you need a known-good state fast. Hotfix if you identified a specific line of code that broke and can fix it in under 10 minutes. The diagram shows the decision: if it's fixable fast, try hotfix; if that doesn't work or you can't diagnose the cause, rollback is your safe bet.

Deployment rollback decision tree

When to use this template

How to adapt it

Mermaid code

Frequently asked questions

Related templates

Blue-green deployment strategy

Service degradation strategy

Incident state machine