Back to Blog
Research Study

False Positive Rates in A/B Testing: The Hidden Problem

Updated December 2025
12 min read
TL;DR

The problem: With a typical threshold (α = 0.05), some false positives are expected by design. Bad practices (peeking, many metrics, changing the plan) can inflate the risk dramatically. The fix isn’t “more math”—it’s better process: one primary metric, a real stopping rule, and discipline.

Important note

Any specific percentages you see quoted online (“X% of winners are false”) depend on the experiment design and behavior. This article focuses on the mechanisms that create false positives and the practical fixes you can adopt immediately.

The Base Rate Problem

At 95% confidence (p < 0.05), you accept a 5% false positive rate by design. This means:

If you run 20 tests where there's no real difference, 1 will show as "significant" by pure chance.

What Inflates False Positive Rates

Peeking at results

High

Checking results daily and stopping when “significant” inflates error.

Multiple metrics

High

Looking at many metrics increases the chance one looks “significant.”

Small samples / noisy metrics

Medium

Underpowered tests produce unstable estimates and more churn.

Changing the plan mid-test

High

Tweaking targeting, variants, or goals mid-run invalidates inference.

The Cumulative Effect

These effects compound. A typical “bad practice” workflow might look like:

  • • Run the test without a planned sample size
  • • Check results daily and share screenshots in Slack
  • • Look at 5+ metrics until one looks good
  • • Stop the moment something looks “significant”

This workflow can create “winners” that disappear on re-test or fail to replicate in production.

How to Reduce False Positives

  • Calculate sample size upfront and commit to it
  • Don't peek at results until you reach sample size
  • Define one primary metric before the test
  • Use sequential testing if you must peek (Bayesian or SPRT)
  • Apply Bonferroni correction for multiple metrics

False-positive prevention checklist

Pick one primary metric

Decide the one metric that determines “win/lose” before you launch.

Commit to a stopping rule

Fixed-horizon (sample size upfront) or a true sequential method. Avoid “stop when p<0.05.”

Limit multiple comparisons

If you must check multiple metrics/variants, apply a correction or use hierarchical metrics.

Run AA tests occasionally

AA tests (A vs A) help validate your pipeline and reveal bias/bugs.

If you’re seeing lots of “wins” that don’t replicate, also check for sample ratio mismatch (SRM).

Run Valid Tests

ExperimentHQ uses proper statistical methods and warns you about peeking. Get results you can trust.

Share this article

Ready to start A/B testing?

Free forever plan available. No credit card required.