Async job queue pattern
Producer queues work, consumer processes it, with retry and dead-letter paths.
Async job queues are the backbone of scalable systems: they let you accept a request, return a response instantly, and process the expensive work later. A producer (your request handler) enqueues a job; a consumer (your worker process) dequeues and processes it. If the worker fails, the job retries. If retries are exhausted, it lands in the dead-letter queue for a human to fix.
This template shows the happy path (success), the retry path (transient failure), and the dead-letter path (permanent failure). Use it to design your queue behavior before you build it — especially the retry count, backoff strategy, and what "permanent failure" means in your domain.
When to use this template
- Email/notification system design — user submits a form, gets an instant confirmation, and the email sends in the background. Retries handle temporary mail-server outages.
- Batch processing — a report takes 30 seconds to generate. Queue it, return a "report pending" response, generate it in a worker, and email the result.
- Third-party API integration — calling Stripe, Twilio, or a map service can time out or fail transiently. The queue + retry pattern isolates your app from their blips.
How to adapt it
Customize the failure recovery:
- Exponential backoff — show the retry delays (1s, 2s, 4s, 8s) between attempts so teams understand how long failed jobs can block a user's workflow.
- Webhook callback — after success, the worker calls back to your app with the result, or writes to a result table. Show that feedback loop so you know when work completes.
- Monitoring and alerting — add a step where the worker publishes metrics (job duration, error rate) so you can alert if the DLQ starts growing or workers are slow.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
sequenceDiagram
participant Producer as Request Handler
participant Queue as Job Queue
participant Worker as Worker Process
participant Service as External Service
participant DLQ as Dead-Letter Queue
Producer->>Queue: Enqueue job (payload)
Queue-->>Producer: Job ID
Worker->>Queue: Poll for jobs
Queue-->>Worker: Dequeue job
Worker->>Service: Process (call API, compute, etc.)
alt Success
Service-->>Worker: Result
Worker->>Queue: Mark job complete
else Timeout/Failure
Worker->>Queue: Increment retry count
alt Max retries exceeded
Queue->>DLQ: Move to dead-letter queue
DLQ-->>Worker: (no further action)
else Retry available
Queue-->>Worker: Job back to queue
end
end
Frequently asked questions
- What is the job queue pattern and why do I need it?
- It decouples work producers from workers: a request handler enqueues a job and returns to the user immediately; a separate worker process dequeues and completes it. This pattern lets you scale workers independently, handle failures gracefully with retries, and prevent a slow task from blocking the whole application. It's essential for sending emails, generating reports, or calling expensive APIs.
- What should go in the dead-letter queue?
- Jobs that have failed all retries go to the DLQ so a human can investigate. Common causes: the external service is permanently broken, the input data is malformed, or a secret/API key expired. The DLQ is your early-warning system for systematic failures. Monitor it for alerts (if the DLQ is growing, something is wrong).
- How many times should a job retry before giving up?
- Typically 3–7 retries with exponential backoff (wait 1s, then 2s, then 4s, etc.). If a transient failure (network blip, service slow) caused it, the retry succeeds. If a systemic failure caused it (service down for hours, data corruption), all retries fail and the job lands in the DLQ for manual review. Exponential backoff prevents hammering a recovering service.
- Can I use this pattern for critical work like payment processing?
- Yes, but be careful: if a payment job lands in the DLQ, a human must investigate to ensure the customer is not double-charged or silently not charged. Add idempotency (every job has a unique key; reprocessing with the same key is safe), so that retries don't duplicate work. Use this pattern for critical work when you have monitoring and runbooks in place.