"We reached 95% statistical significance!" But what does that actually mean? Most marketers and product managers use this term without fully understanding the statistics behind it. This guide will make you fluent in the language of experimentation statistics.
What is Statistical Significance?
Statistical significance tells you the probability that your observed results are not due to random chance. When we say a result is "statistically significant at 95%," we mean:
"There's only a 5% probability that we'd see this difference (or greater) if there was actually no real difference between variants."
The 5% is called the p-value. A lower p-value means stronger evidence that your result is real.
The Key Statistical Concepts
1. Confidence Level (typically 95%)
The probability that your confidence interval contains the true effect. 95% is standard, but some teams use 90% for faster decisions or 99% for high-stakes tests.
2. Statistical Power (typically 80%)
The probability of detecting a real effect when one exists. 80% power means you have a 20% chance of missing a real winner (false negative).
3. Minimum Detectable Effect (MDE)
The smallest improvement you want to be able to detect. Smaller MDE = more samples needed. Typically 10-20% relative improvement.
4. Baseline Conversion Rate
Your current conversion rate. Lower baseline = more samples needed to detect the same relative improvement.
Sample Size Reference Table
Here's how many visitors you need per variant at 95% confidence and 80% power:
| Baseline Rate | MDE (Relative) | Sample Per Variant |
|---|---|---|
| 1% | 20% | ~78,000 |
| 2% | 20% | ~38,000 |
| 5% | 20% | ~15,000 |
| 10% | 20% | ~7,200 |
| 5% | 10% | ~62,000 |
| 5% | 30% | ~6,700 |
* Two-tailed test. For a 50/50 split, multiply by 2 for total sample size.
5 Statistical Mistakes That Ruin Tests
1. Peeking and Stopping Early
Checking results daily and stopping when you hit significance inflates your false positive rate from 5% to 30%+. Set a sample size and stick to it.
2. Ignoring Multiple Comparisons
Testing 20 metrics? At 95% confidence, you'll get 1 false positive by chance. Use Bonferroni correction or designate one primary metric.
3. Confusing Statistical and Practical Significance
A 0.1% improvement can be statistically significant with enough data. Ask: "Is this improvement worth implementing?"
4. Not Running Full Business Cycles
Weekend visitors behave differently than weekday visitors. Run tests for at least 1-2 full weeks to capture all user segments.
5. Underpowered Tests
Running a test with 500 visitors when you need 5,000 means you'll miss real effects. Calculate sample size before starting.
Frequentist vs. Bayesian Statistics
Frequentist (Traditional)
- Fixed sample size, then analyze
- "95% confident the true effect is in this range"
- Industry standard, well understood
Bayesian
- Can peek at results without penalty
- "90% probability B is better than A"
- More intuitive interpretation
Our recommendation: For most teams, frequentist statistics work well. The key is picking one approach and using it consistently. ExperimentHQ uses frequentist statistics with built-in guardrails against common mistakes.
When to Call a Winner: The Checklist
Master the Basics, Trust the Process
Statistical significance isn't magic—it's a framework for making decisions under uncertainty. By understanding these concepts, you'll avoid the most common pitfalls and run experiments that actually produce reliable insights.
Remember: a well-designed experiment with proper statistical rigor will always beat gut instinct. The numbers don't lie—but only if you collect and interpret them correctly.