Service health check loop
Monitoring, alerting, and recovery when services become unhealthy.
Every production system needs to know when its services are down. This diagram shows the health check loop that keeps infrastructure resilient: a monitor pings your service on a schedule, counts failures, and alerts the team when the count exceeds a threshold. The key insight is the threshold — one failed ping doesn't mean your service is broken, but three in a row probably does. The recovery step shows what happens next: automatic remediation (restart, failover, switch traffic) before escalating to a human.
Teams that skip health checks ship services that fail silently. A database connection pool exhausts, or a memory leak causes slowness, or a downstream dependency becomes unavailable — and nobody knows until a customer's transaction times out. This diagram is how you automate visibility: you catch the failure, alert the team, and attempt recovery before customer impact.
When to use this template
- Incident prevention — sketch your monitoring topology before you go live, so you know which services are watched, what counts as unhealthy, and who gets paged.
- On-call runbooks — show new on-call engineers what the health check alerts mean and what the remediation logic tries before they need to wake up and investigate.
- Architecture reviews — in a Kubernetes or microservices environment, confirm that your orchestrator's health checks align with your application's actual failure modes.
How to adapt it
Extend the diagram to match your monitoring stack and SLAs:
- Add a parallel check for database connectivity or external API availability, so you see not just the service but its dependencies.
- Insert a metrics collection step (CPU, memory, disk) that feeds into a decision: if CPU is 95%, skip the remediation and go straight to escalation.
- Replace the 30-second interval with your actual SLA — a financial transaction service might check every 5 seconds, a batch job every 5 minutes.
Visual edits regenerate clean Mermaid code, so you can turn this template into your team's actual monitoring policy by renaming nodes and adjusting timings in the editor.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
flowchart TD
A[Health check scheduler runs] --> B[Send ping to service endpoint]
B --> C{Endpoint responds within timeout?}
C -->|Yes| D{Status code 200?}
D -->|Yes| E[Record healthy state]
E --> F[Wait 30 seconds]
F --> A
D -->|No| G[Increment failure count]
C -->|No| G
G --> H{Failures exceed threshold?}
H -->|No| F
H -->|Yes| I[Mark service unhealthy]
I --> J[Send alert to on-call]
J --> K[Trigger remediation]
K --> L{Recovery attempt successful?}
L -->|Yes| E
L -->|No| M[Page PagerDuty]
M --> N[Engineer investigates]
Frequently asked questions
- What is a health check and why do I need one?
- A health check is a lightweight request (usually an HTTP GET) that asks 'is this service working?' If the check fails repeatedly, the monitoring system assumes something is broken and alerts the team. This diagram shows the full loop: periodic checks, threshold counting, alerting, and recovery attempts. Without health checks, your service can fail silently for hours before a customer complains.
- How often should I run health checks?
- It depends on your tolerance for downtime. For critical services, run checks every 10–30 seconds. For less critical services, every 2–5 minutes is reasonable. Too frequent and you'll spam your system; too infrequent and you'll miss failures. The diagram uses 30 seconds as a typical starting point — adjust based on your SLA.
- Should my health check endpoint do real work, or just return OK?
- A deep health check queries the database, calls a downstream dependency, and verifies the service can actually do its job. A shallow health check just means the process is running. Use deep checks for critical paths; shallow checks for high-volume services where a deep check would add too much latency. This diagram assumes both are possible — the endpoint just needs to return 200 quickly.
- What happens if a health check alerts while the service is still recovering?
- That's exactly what the failure threshold is for. If your service is restarting and takes 5 seconds to come up, set your threshold to 2–3 consecutive failures so a single brief blip doesn't fire an alert. The 'recovery attempt' step in this diagram gives remediation (restart, failover) a chance to work before waking up the on-call engineer.
Related templates
Auto-scaling decision tree
CPU, memory, request volume, and cost trade-off decisions.
Database migration flow
Safe schema changes with validation, rollback, and production cutover.
Deployment rollback decision tree
Incident detection, severity assessment, and rollback trigger criteria.