Circuit breaker pattern
Fail fast and recover gracefully when downstream services fail.
Circuit breaker is a design pattern that prevents cascading failures in microservices. When a downstream service (like a payment database) starts failing, a circuit breaker detects the pattern, stops sending requests to it, and instead returns errors immediately. This gives the failing service time to recover and prevents your service from wasting resources on doomed requests.
The pattern has three states: CLOSED (normal operation, requests flow through), OPEN (too many failures, reject requests fast), and HALF_OPEN (probing for recovery). Once failures spike above a threshold, the breaker opens; after a timeout, it sends a test request to see if the service recovered; if yes, it closes and resumes normal operation.
When to use this template
- Designing resilience — document how your service handles database outages, API failures, and cascading problems before they happen.
- Explaining service dependencies to engineers — this diagram makes clear that every external call can fail and shows how your service responds.
- Post-incident reviews — annotate the state transitions with what actually happened during an outage (how many requests queued up, when the service recovered).
How to adapt it
Rename "Payment DB" to your actual failing dependency (Stripe API, Elasticsearch cluster, another microservice) and adjust the threshold and timeout to your system:
- Add a fallback path — after the circuit opens, show a degraded-mode response (e.g. process payment asynchronously instead of synchronously) so your service still works with reduced features.
- Layer in logging and alerts — annotate the OPEN transition with a monitoring alert so on-call engineers know the circuit tripped and investigate the root cause.
- Show cache bypass — if the circuit opens, the service might query a cache instead of the database, trading freshness for availability.
Visual edits regenerate clean code, so you can adjust failure thresholds and response behavior without rewriting the pattern.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
sequenceDiagram
participant Client
participant Service as Payment Service
participant Circuit as Circuit Breaker
participant DB as Payment DB
Client->>Service: Process payment
Service->>Circuit: Check state
Circuit-->>Service: CLOSED (pass through)
Service->>DB: Execute transaction
DB-->>Service: Success
Service-->>Client: 200 OK
Note over Circuit: Requests succeed, breaker stays CLOSED
Client->>Service: Process payment
Service->>Circuit: Check state
Circuit-->>Service: CLOSED (pass through)
Service->>DB: Execute transaction
DB-->>Service: Timeout!
Service-->>Circuit: Failure recorded
Circuit-->>Service: Still CLOSED
Service-->>Client: 500 Error
Note over Circuit: Failures accumulate...
Circuit->>Circuit: Failure count > threshold
Circuit-->>Service: OPEN (fast-fail)
Client->>Service: Process payment
Service->>Circuit: Check state
Circuit-->>Service: OPEN (reject immediately)
Service-->>Client: 503 Service Unavailable
Note over Circuit: Circuit is OPEN, DB gets breathing room
Circuit->>Circuit: Wait timeout seconds
Circuit-->>Service: HALF_OPEN (probe)
Service->>DB: Test request
DB-->>Service: Success!
Circuit-->>Service: CLOSED (recovered)
Frequently asked questions
- What is the circuit breaker pattern?
- It's a resilience pattern that wraps calls to an unreliable service. When failures spike (e.g. database times out 5 times in a row), the circuit breaker opens and stops sending requests, returning errors immediately instead. After a timeout, it probes the service; if the probe succeeds, it closes and resumes normal operation.
- Why not just let requests timeout on a broken service?
- Timeouts waste resources and slow down the client. If your database is down, hammering it with requests keeps it down longer and wastes threads in your service. A circuit breaker detects the problem, stops sending requests, and gives the database time to recover. Clients get fast failures instead of slow timeouts.
- What do CLOSED, OPEN, and HALF_OPEN mean?
- CLOSED = normal, all requests go through. OPEN = service is failing, requests are rejected immediately (fast-fail). HALF_OPEN = service might be recovering, the breaker sends a test request; if it succeeds, transition to CLOSED, if it fails stay OPEN. This prevents yo-yo transitions between OPEN and CLOSED.
- How do I set the failure threshold and timeout?
- Thresholds depend on your tolerance: 5 failures in 10 seconds is common for APIs (fast detection). Timeout is usually 30–60 seconds (give the service time to restart). Start conservative and adjust based on your incident reviews — if a database restart takes 2 minutes, your timeout should be longer.