Question 1

What is a health check and why do I need one?

Accepted Answer

A health check is a lightweight request (usually an HTTP GET) that asks 'is this service working?' If the check fails repeatedly, the monitoring system assumes something is broken and alerts the team. This diagram shows the full loop: periodic checks, threshold counting, alerting, and recovery attempts. Without health checks, your service can fail silently for hours before a customer complains.

Question 2

How often should I run health checks?

Accepted Answer

It depends on your tolerance for downtime. For critical services, run checks every 10–30 seconds. For less critical services, every 2–5 minutes is reasonable. Too frequent and you'll spam your system; too infrequent and you'll miss failures. The diagram uses 30 seconds as a typical starting point — adjust based on your SLA.

Question 3

Should my health check endpoint do real work, or just return OK?

Accepted Answer

A deep health check queries the database, calls a downstream dependency, and verifies the service can actually do its job. A shallow health check just means the process is running. Use deep checks for critical paths; shallow checks for high-volume services where a deep check would add too much latency. This diagram assumes both are possible — the endpoint just needs to return 200 quickly.

Question 4

What happens if a health check alerts while the service is still recovering?

Accepted Answer

That's exactly what the failure threshold is for. If your service is restarting and takes 5 seconds to come up, set your threshold to 2–3 consecutive failures so a single brief blip doesn't fire an alert. The 'recovery attempt' step in this diagram gives remediation (restart, failover) a chance to work before waking up the on-call engineer.

Service health check loop

When to use this template

How to adapt it

Mermaid code

Frequently asked questions

Related templates

Auto-scaling decision tree

Database migration flow

Deployment rollback decision tree