Message queue retry logic
Publish-subscribe with exponential backoff and dead-letter queues.
Every production system needs asynchronous work: sending emails, processing payments, updating analytics. This sequence diagram shows the reliable way to do it using a message queue: a producer publishes an event, the queue delivers it to a consumer, and if the consumer fails, the queue automatically retries with exponential backoff. If all retries fail, the message moves to a dead-letter queue where engineers can inspect and replay it.
Without this pattern, you lose messages silently. A consumer crashes mid-processing, the message is gone, and a customer's email never arrives. With a queue and retries, transient failures (database timeout, network blip) recover automatically, and permanent failures (bad code) are logged for debugging instead of disappearing.
When to use this template
- Event-driven architecture — design your retry and DLQ strategy before building the consumers, so you know how many retries to allow and who will monitor the DLQ.
- Incident response — when a consumer bug causes messages to pile up in the DLQ, this diagram explains the workflow for fixing the bug and replaying the queue.
- Consumer documentation — show new developers what the queue expects (message format, error codes, when to throw retryable vs permanent exceptions).
How to adapt it
Extend the diagram to match your queue topology and business SLAs:
- Add a second consumer (or multiple consumers reading from the same queue) to show how competing consumers each process a subset of messages.
- Insert a circuit breaker step before the database call, so the consumer fails fast if downstream is clearly unavailable instead of retrying into a DLQ.
- Replace the exponential backoff with a priority queue, so critical events (payments) jump ahead of bulk work (analytics).
Visual edits regenerate clean Mermaid code, so you can adapt this template to your actual queue configuration by editing participant names, message labels, and backoff timings in the editor.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
sequenceDiagram
actor Producer
participant Queue as Message Queue
participant Consumer
participant DB as Database
Producer->>Queue: Publish order.created event
Queue-->>Producer: Event enqueued
Queue->>Consumer: Deliver message (attempt 1)
Consumer->>DB: Process order (write to DB)
alt Success
DB-->>Consumer: Order created
Consumer-->>Queue: ACK (acknowledge)
Queue->>Queue: Delete message
else Failure - retryable
Consumer-->>Queue: NACK with retry
Queue->>Queue: Increment retry count, backoff 2s
Note over Queue: Wait 2 seconds
Queue->>Consumer: Redeliver (attempt 2)
Consumer->>DB: Process order
DB-->>Consumer: Success
Consumer-->>Queue: ACK
Queue->>Queue: Delete message
else Failure - max retries exceeded
Consumer-->>Queue: NACK final
Queue->>Queue: Move to dead-letter queue
end
Frequently asked questions
- What is a message queue and why do I need one?
- A message queue (Kafka, RabbitMQ, SQS) lets services communicate asynchronously: a producer publishes an event, and a consumer processes it later. This decouples the services — the producer doesn't wait for the consumer to finish. If the consumer crashes, the message stays in the queue. This diagram shows the key feature: automatic retries with exponential backoff, so transient failures (a database hiccup) recover automatically without operator intervention.
- What is exponential backoff and when should I use it?
- Exponential backoff means waiting longer between each retry: 1 second, then 2, then 4, then 8. This is better than retrying immediately because it gives the downstream service time to recover. If a database is overloaded and every consumer retries immediately, you just make it worse. This diagram uses 2-second backoff as a starting point — adjust based on your service's typical recovery time.
- What is a dead-letter queue (DLQ)?
- When a message fails all retries (usually 5–10 attempts), it's poisoned — something permanent is broken. Rather than looping forever or dropping it, the message moves to a dead-letter queue where engineers can inspect it, fix the bug, and replay it. This diagram shows that workflow: retries until max count, then DLQ. The DLQ is your escape hatch for hard-to-debug production issues.
- How do I know if a failure is retryable or permanent?
- Retryable: database timeout, network hiccup, service temporarily unavailable. Permanent: malformed message, logic error in the consumer, missing configuration. A good consumer returns different status codes or exceptions for each case. If you get a 503 (Service Unavailable), retry. If you get a 400 (Bad Request), send to the DLQ. This diagram assumes the consumer makes that distinction.