Observability stack architecture
Metrics, logs, traces, dashboards, and alerting for production systems.
Production systems fail in surprising ways, and without observability, you are troubleshooting blind. This template maps the layers of an observability stack: application code emitting metrics, logs, and traces; collectors (Prometheus, Fluentd, Jaeger) gathering signals; aggregators (time-series databases, log indexers, trace stores) indexing them; query engines (PromQL, LogQL) making them searchable; dashboards for exploration; and alerting rules that notify on-call when something breaks.
Each layer has a purpose. Metrics answer "Is the system slow?" Logs answer "What error message did we see?" Traces answer "Which microservice is the bottleneck?" Together, they give you enough signal to diagnose and fix incidents before they become outages.
When to use this template
- Designing observability infrastructure — map out which signals you collect (metrics only? logs too? traces for latency debugging?), where they go (vendor SaaS or self-hosted), and how you query them before you start shipping terabytes of logs.
- Incident response runbooks — reference this diagram in your runbooks so on-call knows to check the dashboard first, then drill into logs, then trace a specific request for latency profiling.
- Engineering team onboarding — show new hires how to instrument their code (emit metrics) and where to look when something breaks (dashboard first, then logs, then traces).
How to adapt it
Customize the stack to your architecture and budget:
- Add log aggregation filtering (send only errors and warnings to the log store, sample debug logs) to reduce storage costs while keeping high-fidelity data for incidents.
- Insert anomaly detection between metrics and alerts, so the system learns normal baselines and alerts only on statistical outliers, not fixed thresholds.
- Extend alerting to show escalation (page on-call after 5 minutes of high error rate; escalate if no ack after 15 minutes) so incidents are routed to the right team fast.
Visual edits regenerate the class diagram, so you can sketch your stack topology and review it with your infrastructure and platform teams.
Mermaid code
Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.
classDiagram
class Application {
emit metrics
emit logs
emit traces
}
class Collector {
Prometheus
Fluentd
Jaeger agent
}
class Aggregator {
Time-series DB
Log indexer
Trace store
}
class QueryEngine {
PromQL
LogQL
Trace queries
}
class Dashboard {
Grafana
Real-time charts
Heatmaps
}
class Alerting {
Alert rules
Email & PagerDuty
Escalation
}
class Runbook {
Incident docs
Recovery steps
}
Application --> Collector : push/pull
Collector --> Aggregator : forward
Aggregator --> QueryEngine : index
QueryEngine --> Dashboard : query
QueryEngine --> Alerting : query
Alerting --> Runbook : trigger
Frequently asked questions
- What is an observability stack and why does my system need one?
- An observability stack collects signals from your application — metrics (latency, error rate, CPU), logs (what happened when), and traces (request path through microservices) — and makes them queryable. Without it, you are flying blind: when users complain, you have no data to diagnose what went wrong. With it, you see problems in real time and have the evidence to fix them fast.
- What is the difference between metrics, logs, and traces?
- Metrics are time-series numbers: request latency, memory usage, error count. Logs are timestamped events: '[ERROR] Payment processor timeout'. Traces are request journeys: 'request entered gateway, called auth service, called payment service, returned'. Each answers different questions. Metrics show trends; logs show what broke; traces show where time is spent.
- Why do dashboards and alerting query the same engine?
- Dashboards let you explore the data; alerting lets the system notify you when something is wrong. Both are queries. A dashboard might ask 'Show me p99 latency over the last hour'. An alert asks 'Is p99 latency above 500ms?' Using the same query engine means your alert thresholds are consistent with what you see on the dashboard.
- How do I adapt this for my cloud provider (AWS, GCP, Azure)?
- Replace the tools with your provider's native services: AWS uses CloudWatch for metrics/logs, DataDog or New Relic for unified observability; GCP uses Cloud Monitoring and Cloud Logging; Azure uses Azure Monitor. The architecture is the same: collect from your app, aggregate, query, visualize, alert. Visual edits regenerate clean Mermaid, so you can diagram your specific stack.