You ran an A/B test. Variant B is up 12%. Is that real, or just noise? Statistical significance answers that question. Here is what it means, how it works, and the mistakes that trip up most teams.
What Statistical Significance Actually Means
A result is statistically significant when it is unlikely to have occurred by random chance alone. In A/B testing, the standard threshold is 95% confidence — meaning there is less than a 5% probability the difference you see is just luck.
Think of it like flipping a coin. If you flip heads 6 times out of 10, that could easily be random. But if you flip heads 90 times out of 100, something is probably going on. Statistical significance quantifies that "probably."
P-Values: The Simple Explanation
The p-value is the probability of seeing a result as extreme as yours if there were truly no difference between your variants.
p = 0.03
There is a 3% chance this result is random noise. That is below 5%, so it is statistically significant at the 95% level.
p = 0.15
There is a 15% chance this result is noise. Not significant. Keep running or end the test.
A low p-value does not tell you how big the effect is. It only tells you the result is unlikely to be random. Always look at effect size alongside significance.
How Many Visitors Do You Need?
The required sample size depends on three things: your baseline conversion rate, the minimum effect you want to detect, and the confidence level you require.
3%
Baseline conversion rate
10%
Minimum detectable effect (relative)
~25k
Visitors per variation needed
Lower baseline rates and smaller effects require dramatically more traffic. If your site converts at 1% and you want to detect a 5% relative lift, you may need hundreds of thousands of visitors per variation.
Always calculate your required sample size before you start the test. Running a test without enough traffic is the most common reason tests fail to reach significance.
5 Common Mistakes
1. Peeking at results
Checking your test daily and stopping when it looks good inflates your false positive rate. A test that shows p = 0.04 on day 3 may be p = 0.20 by day 14. Set a sample size target and wait.
2. Stopping too early
Early results are noisy. Small samples produce wild swings. Always run until you hit your pre-calculated sample size or at least two full business weeks.
3. Ignoring statistical power
Power is the probability of detecting a real effect. Standard is 80%. Low power means you will miss real winners and declare "no difference" prematurely.
4. Running too many variants
Each additional variant increases the chance of a false positive. If you test 5 variants, you need to adjust your significance threshold (Bonferroni correction) or use a method designed for multiple comparisons.
5. Cherry-picking metrics
Decide on your primary metric before the test starts. If you check 20 metrics after the fact, one will be significant by pure chance.
When to Call a Test
You reached your pre-calculated sample size
This is the most important rule. No sample size, no conclusion.
P-value is below 0.05 (or your chosen threshold)
You have a statistically significant winner. Implement it.
P-value is above 0.05 at full sample size
No significant difference found. Keep the original and test something else.
The test has run for at least 1-2 full weeks
This accounts for day-of-week effects. Weekend traffic behaves differently from weekday traffic.
Stop Guessing, Start Knowing
ExperimentHQ calculates statistical significance automatically. You see a clear indicator when your test reaches confidence — no spreadsheets, no manual math, no guessing when to stop. Just reliable results you can act on.