Back to Blog
Statistics

Statistical Significance in A/B Testing Explained

April 5, 2025
7 min read

You ran an A/B test. Variant B is up 12%. Is that real, or just noise? Statistical significance answers that question. Here is what it means, how it works, and the mistakes that trip up most teams.

What Statistical Significance Actually Means

A result is statistically significant when it is unlikely to have occurred by random chance alone. In A/B testing, the standard threshold is 95% confidence — meaning there is less than a 5% probability the difference you see is just luck.

Think of it like flipping a coin. If you flip heads 6 times out of 10, that could easily be random. But if you flip heads 90 times out of 100, something is probably going on. Statistical significance quantifies that "probably."

P-Values: The Simple Explanation

The p-value is the probability of seeing a result as extreme as yours if there were truly no difference between your variants.

p = 0.03

There is a 3% chance this result is random noise. That is below 5%, so it is statistically significant at the 95% level.

p = 0.15

There is a 15% chance this result is noise. Not significant. Keep running or end the test.

A low p-value does not tell you how big the effect is. It only tells you the result is unlikely to be random. Always look at effect size alongside significance.

How Many Visitors Do You Need?

The required sample size depends on three things: your baseline conversion rate, the minimum effect you want to detect, and the confidence level you require.

3%

Baseline conversion rate

10%

Minimum detectable effect (relative)

~25k

Visitors per variation needed

Lower baseline rates and smaller effects require dramatically more traffic. If your site converts at 1% and you want to detect a 5% relative lift, you may need hundreds of thousands of visitors per variation.

Always calculate your required sample size before you start the test. Running a test without enough traffic is the most common reason tests fail to reach significance.

5 Common Mistakes

1. Peeking at results

Checking your test daily and stopping when it looks good inflates your false positive rate. A test that shows p = 0.04 on day 3 may be p = 0.20 by day 14. Set a sample size target and wait.

2. Stopping too early

Early results are noisy. Small samples produce wild swings. Always run until you hit your pre-calculated sample size or at least two full business weeks.

3. Ignoring statistical power

Power is the probability of detecting a real effect. Standard is 80%. Low power means you will miss real winners and declare "no difference" prematurely.

4. Running too many variants

Each additional variant increases the chance of a false positive. If you test 5 variants, you need to adjust your significance threshold (Bonferroni correction) or use a method designed for multiple comparisons.

5. Cherry-picking metrics

Decide on your primary metric before the test starts. If you check 20 metrics after the fact, one will be significant by pure chance.

When to Call a Test

1

You reached your pre-calculated sample size

This is the most important rule. No sample size, no conclusion.

2

P-value is below 0.05 (or your chosen threshold)

You have a statistically significant winner. Implement it.

3

P-value is above 0.05 at full sample size

No significant difference found. Keep the original and test something else.

4

The test has run for at least 1-2 full weeks

This accounts for day-of-week effects. Weekend traffic behaves differently from weekday traffic.

Stop Guessing, Start Knowing

ExperimentHQ calculates statistical significance automatically. You see a clear indicator when your test reaches confidence — no spreadsheets, no manual math, no guessing when to stop. Just reliable results you can act on.

Share this article

Ready to start A/B testing?

Free forever plan available. No credit card required.