All templates
Flowchart template

A/B test result analysis and rollout

Evaluate test data, check statistical significance, rollout or rollback.

An A/B test ends and you have results — but results alone don't mean you should ship the change. The data might be noisy, the sample might be too small, or the difference might not matter for your business. This diagram walks you through the decision: test validity (was the sample large enough?), statistical significance (is the difference real?), business impact (does it matter?), and production readiness (does it still work when you roll it out to all users?). Each decision point represents real-world questions product and engineering ask every day.

The gradual rollout path (10% → 50% → 100%) is where most learned lessons come from. A variant that wins in the test often behaves differently in production: it's slower at scale, it interacts badly with other code, or it exposes an edge case. Staged rollout lets you catch these surprises before they affect 100% of users.

When to use this template

  • Experimentation playbooks — customize for your metrics and thresholds, then share with the team so everyone follows the same decision process.
  • Product review meetings — walk through hypothetical test results and decision paths so stakeholders understand what constitutes a "win".
  • Incident post-mortems — if a bad rollout made it to production, use this diagram to audit where the process broke (sample size? significance check? production validation?).

How to adapt it

Tailor each decision gate to your business:

  • Replace "Sample size sufficient?" with your actual calculation — most A/B testing tools compute this; know your minimum before you start.
  • Replace "Statistically significant?" with your chosen confidence level (95% is standard, but you might want 99% for high-risk features).
  • Customize "Business impact?" thresholds — a 2% conversion lift matters for marketplaces, might not matter for internal tools.
  • Add domain-specific gates — if you're testing a payment flow, add a fraud-rate check before production rollout.
  • Adjust rollout percentages — high-risk features might start at 1%, low-risk at 25%.

Visual edits regenerate clean Mermaid code as you adapt this to your test process.

Mermaid code

Copy it anywhere Mermaid is supported — GitHub, Notion, or your docs.

flowchart TD
    A[A/B test ends] --> B[Collect results data]
    B --> C{Sample size sufficient?}
    C -->|No| D[Extend test or increase traffic]
    D --> A
    C -->|Yes| E{Statistically significant?}
    E -->|No| F[Declare inconclusive]
    F --> G[Archive and move on]
    E -->|Yes| H{Which variant won?}
    H -->|Control wins| I[Keep status quo]
    H -->|Variant wins| J{Business impact?}
    J -->|Minimal| K[Archive wins but low ROI]
    J -->|Significant| L[Schedule rollout]
    I --> G
    K --> G
    L --> M[Rollout to 10% traffic]
    M --> N{Metrics hold up?}
    N -->|No| O[Rollback to control]
    N -->|Yes| P[Expand to 50%]
    O --> G
    P --> Q{Issues observed?}
    Q -->|Yes| O
    Q -->|No| R[Roll out to 100%]
    R --> G

Frequently asked questions

What does statistical significance mean in an A/B test?
It means the result is unlikely to be due to chance. A p-value under 0.05 is standard: if you see variant A perform 10% better, significance tells you whether that 10% is real or just random noise. With a small sample or short test, you might see big differences that vanish when you run longer. Statistical significance is your gatekeeper — it prevents chasing false winners.
How long should I run an A/B test?
Until you have statistical significance with enough sample size to detect the effect you care about, typically 1-2 weeks for high-traffic features. Longer tests catch weekly/seasonal patterns. Shorter tests miss user cohorts. A/B testing platforms calculate confidence intervals and sample-size requirements — use those numbers, don't guess. If after 2 weeks you lack significance, the effect might be too small to matter.
Why do you rollout gradually instead of all at once?
Gradual rollout (10% → 50% → 100%) catches issues the test environment missed: unexpected interactions with real data, performance in production at scale, edge cases in specific user segments. If metrics break at 10%, you've only affected 10% of users. If you discover an issue at 100%, the damage is much larger. Staged rollout is cheap insurance.
What should I do if the variant performs worse in production than in the test?
Rollback immediately — production has different scale, data distribution, and user behavior than your test environment. Document what broke and whether it was a real user impact or a measurement issue. Sometimes tests are inconclusive and the variant genuinely is worse in the wild; that's valuable data. Move on to the next test rather than forcing a bad variant.

Related templates