Request retry strategy
State machine for exponential backoff, jitter, and circuit breaker integration.
Every production system talks to other services — APIs, databases, message queues. They fail. Your job is not to prevent failure (impossible) but to recover from it intelligently. This template shows the state machine: send a request, get a retriable error (5xx, timeout), wait with exponential backoff and jitter, then retry — but stop after max retries and fall back to a cache or sensible default (circuit breaker).
The key insight: naive retry logic (retry immediately, retry forever) makes outages worse by amplifying load. Exponential backoff and jitter give the broken service time to recover. The circuit breaker stops your code from burning CPU and memory on a service that is already down.
When to use this template
- Resilience design reviews — sketch your retry strategy for external APIs, payment processors, or analytics services so the team agrees on how long to wait and when to give up.
- Incident postmortems — when an outage cascades, trace whether your retry logic made it worse (hammering an already-dying service) or better (backing off and letting it recover).
- Onboarding distributed systems — show new engineers the state machine so they understand that retry logic is not just "try again", it is a careful dance of timeouts, backoff, and fallbacks.
How to adapt it
Customize the backoff formula and thresholds to your SLAs:
- Change the backoff formula from
2^ntomin(max_backoff, base * 2^n)to cap the wait time — e.g., never wait longer than 30 seconds. - Add a deadline or timeout on total retry time — e.g., keep retrying for up to 5 minutes, then fail permanently.
- Branch the circuit breaker into degraded mode — return stale cached data instead of a full error, and keep retrying in the background.
Visual edits regenerate clean Mermaid code, so you can model your exact recovery strategy without hand-editing YAML or JSON config files.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
stateDiagram-v2
[*] --> Ready
Ready --> Attempting: Send request
Attempting --> Success: 2xx response
Attempting --> Retriable: 5xx or timeout
Attempting --> Failed: 4xx or non-retriable
Success --> [*]
Failed --> [*]
Retriable --> WaitExponential: Calculate backoff
WaitExponential --> WaitJitter: Add jitter
WaitJitter --> WaitTimer: Sleep attempt_count * 2^n
WaitTimer --> CircuitOpen: Max retries?
CircuitOpen --> Fallback: Use cache or default
Fallback --> [*]
WaitTimer --> Attempting: Retry
Frequently asked questions
- What is exponential backoff and why use it?
- Exponential backoff means waiting longer between retries: 1 second, 2 seconds, 4 seconds, 8 seconds, up to a maximum. If a service is temporarily overloaded, hammering it with 100 retries per second makes it worse. Exponential backoff gives it time to recover while still retrying faster than a fixed delay.
- What is jitter and why add randomness to backoff?
- Jitter adds random variation to the wait time — e.g., 2–4 seconds instead of exactly 4 seconds. Without jitter, all clients hit the server at the same moment after waiting 4 seconds, causing a 'thundering herd' that crashes it again. With jitter, requests spread out, giving the server breathing room.
- When should I integrate a circuit breaker?
- After a threshold of failures (e.g., 5 retries with max backoff exceeded), stop retrying and fast-fail. This is a circuit breaker — it protects your service from wasting resources on a dead downstream, and gives that downstream time to recover. Return a cached response or a sensible default instead.
- How do I implement this without writing boilerplate?
- Use a library: Python has `tenacity`, Node.js has `p-retry` or `async-retry`, Java has Resilience4j, Go has `go-retry`. Most HTTP clients (axios, okhttp, httpClient) have built-in retry strategies. Visual edits regenerate clean code, so you can sketch your strategy here before picking a library.