All templates
Flowchart template

Auto-scaling decision tree

CPU, memory, request volume, and cost trade-off decisions.

Infrastructure scaling is a series of trade-offs: add capacity and you burn money; skimp on capacity and your API melts under load. This template documents the decision tree: monitor metrics (CPU, memory, latency, cost), then decide whether to scale horizontally (more instances), vertically (bigger instances), or optimize the code. Each path has different cost and complexity implications.

The key insight is that scaling is not always the right answer. High CPU can mean a noisy neighbor (one user's query is burning resources), a code bug, or genuine growth. Checking cost and feasibility before scaling prevents runaway infrastructure bills.

When to use this template

  • Runbook design — operationalize your scaling policy: what metrics trigger what actions, and what manual approvals are required for large capacity changes.
  • Architecture review — use this diagram to discuss whether you scale horizontally, vertically, or through code optimization, and which is cheapest for your workload.
  • Incident playbook — when a service melts under load, walk through this diagram to decide if immediate scaling is safe or if you should cap traffic first.

How to adapt it

Customize metrics and thresholds for your service:

  • SLA-driven scaling — replace latency check with p99 / p95 latency targets. Scale out if p99 latency approaches your SLA (e.g., > 500ms for an API).
  • Batch job scaling — for background jobs, replace CPU with queue depth: if queued jobs > threshold, add workers. If queue is empty, scale down.
  • Multi-region failover — after "Scale out", check regional capacity. If local region is saturated, route traffic to secondary region or queue requests.

Visual edits regenerate clean code, so you can tailor metrics and thresholds to your service without rewriting the structure.

Mermaid code

Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.

flowchart TD
    A[Monitor metrics] --> B{CPU > 70%?}
    B -->|No| C{Memory > 80%?}
    B -->|Yes| D[Scale out pods]
    C -->|No| E{Request latency > SLA?}
    C -->|Yes| D
    E -->|No| F{Cost increasing?}
    E -->|Yes| D
    F -->|No| G[Hold current capacity]
    F -->|Yes| H{Vertical scale viable?}
    H -->|No| I[Scale out, optimize code]
    H -->|Yes| J[Increase instance size]
    D --> K[Apply new capacity]
    I --> K
    J --> K
    G --> L[Continue monitoring]
    K --> L

Frequently asked questions

What is an auto-scaling decision tree?
It models when and how to scale infrastructure in response to load: checking CPU, memory, latency, and cost, then deciding whether to scale horizontally (more instances) or vertically (bigger instances). It makes the trade-offs between performance, cost, and complexity explicit.
Why check cost before scaling?
Because scaling fixes the symptom (high CPU) but might not fix the root cause (inefficient code or database query). If cost is rising, vertical scaling or code optimization might be cheaper than horizontal scaling. This decision tree forces teams to consider both.
What metrics should trigger auto-scaling in production?
Use multiple signals: CPU > 70%, memory > 80%, or request latency > your SLA (e.g., p99 > 500ms). Scale before you hit 100% to avoid traffic loss during scale-out delays. Use predictions (if trend suggests CPU will hit 90% in 5 minutes, scale now) for proactive scaling.
How do I model pod eviction and graceful shutdown in this diagram?
After 'Scale out pods', add a decision: 'Draining existing pods?' If yes, route through graceful shutdown (wait for in-flight requests to complete before removing). If no, force-terminate. Add a feedback loop: if scale-out fails due to resource limits, escalate to ops or alert on-call.

Related templates