What is Statistical Significance?
Statistical significance is a way of deciding whether an observed result — a difference between two groups, a treatment effect, a conversion rate improvement — is likely to be real, or whether it could easily have happened by random chance even if there were no real effect.
When we run an A/B test, clinical trial, or any experiment, we collect data from samples. Samples always vary by chance. The question is: is the difference we see large enough, relative to how much random variation we expect, that we can conclude the effect is real? That is what statistical significance tells us.
The conventional threshold is a p-value below 0.05 (5%). This means: if the null hypothesis were true (no real difference), there would be less than a 5% chance of seeing a result this extreme just by chance. Because 5% is small enough that we are willing to reject the null hypothesis and declare the result significant.
What is a p-value? Plain English Explanation
The p-value is the probability of seeing your data (or more extreme data) if there were truly no difference between the groups. It does NOT tell you the probability that your hypothesis is true — a common misconception.
- p = 0.03: Only a 3% chance of seeing this result if there were really no effect. Strong evidence against the null hypothesis. Significant at the 95% level.
- p = 0.08: An 8% chance of seeing this by chance. Not significant at p<0.05, but might be significant at p<0.10 depending on your chosen threshold.
- p = 0.45: A 45% chance of seeing this by chance. No meaningful evidence of a real difference.
The p-value alone does not tell you how large the effect is, how important it is in practice, or whether your study was designed well. It is one piece of evidence in a broader statistical analysis.
The Test Statistic Formula (Two-Proportion Z-Test)
For A/B testing with conversion rates (proportions), the two-proportion z-test is the standard approach. The formula for the test statistic (z-score) is:
where p̂ = (x₁ + x₂) ÷ (n₁ + n₂) (pooled proportion)
Where:
- p₁, p₂ = conversion rates of group A and group B
- n₁, n₂ = sample sizes of each group
- x₁, x₂ = number of conversions in each group
- p̂ = pooled (combined) conversion rate under the null hypothesis
The z-score is then compared to the standard normal distribution critical values:
- |z| ≥ 1.645: Significant at 90% confidence (two-tailed)
- |z| ≥ 1.960: Significant at 95% confidence (two-tailed)
- |z| ≥ 2.576: Significant at 99% confidence (two-tailed)
Worked Example: A/B Test Calculation
Control (A): 10,000 visitors, 500 conversions (5.0%). Test (B): 10,000 visitors, 600 conversions (6.0%).
Step 1: Calculate pooled proportion
Step 2: Calculate standard error
Step 3: Calculate z-score
Step 4: Find p-value. |z| = 3.10 > 2.576 → Significant at 99% confidence
Relative uplift: (6.0% − 5.0%) / 5.0% = +20%
95% CI for difference: 0.01 ± 1.96 × 0.003225 = [0.0037, 0.0163] = [0.37 pp, 1.63 pp]
95% Confidence Level Explained
A 95% confidence level means: if we ran the same experiment 100 times, 95 of those experiments would produce a confidence interval that contains the true population value. It does NOT mean there is a 95% probability the true value is in this specific interval — it either is or is not.
The 95% confidence level is the industry standard for most A/B testing, scientific research, and medical trials. Some industries use 99% (e.g. pharmaceutical trials) for higher certainty; some use 90% (e.g. early-stage product testing) to accept more risk.
Type I and Type II Errors
There are two ways a significance test can lead you astray:
| Error Type | What Happened | Consequence | Controlled by |
|---|---|---|---|
| Type I Error (False Positive) | Declared significant, but no real effect | Implement a change that does not help | Significance level (α). At p<0.05, 5% Type I error rate. |
| Type II Error (False Negative) | Not significant, but a real effect exists | Miss a winning improvement | Statistical power (1−β). Standard is 80% power = 20% Type II rate. |
Lowering your significance threshold (e.g. using p<0.01 instead of p<0.05) reduces Type I errors but increases Type II errors, requiring larger sample sizes to compensate. There is always a trade-off.
Statistical Significance in A/B Testing
In digital marketing and product development, A/B testing (split testing) is used to determine whether a change to a webpage, email, or product feature improves a key metric such as conversion rate, click-through rate, or revenue per user.
Best practices for statistically valid A/B tests:
- Determine the minimum sample size before running the test using a power calculation
- Do not stop the test early just because you see significance — this inflates false positive rates (peeking problem)
- Run the test for at least one full business cycle (typically one or two weeks to account for weekday/weekend differences)
- Use 95% confidence (p<0.05) as the minimum threshold; 99% for high-stakes decisions
- Consider practical significance: a statistically significant 0.01% uplift may not justify the cost of implementation
- Segment your results: a test may be significant overall but driven by one user segment
Statistical Significance in Medicine and Clinical Trials
In clinical research, statistical significance has life-or-death implications. The FDA and EMA (European Medicines Agency) require p<0.05 as a minimum, with most pivotal trials required to demonstrate p<0.001 for primary endpoints. Clinical significance (the size of the effect and its clinical meaningfulness) is given equal or greater weight.
The concept of statistical meaningfulness — also called clinical relevance or practical significance — asks: is the effect large enough to make a real difference to patients? A blood pressure drug that lowers systolic pressure by 0.5 mmHg may be statistically significant in a large trial but clinically meaningless. A 20 mmHg reduction would be both statistically and clinically significant.
Power Calculation: How Many Participants Do You Need?
Statistical power is the probability of detecting a real effect when one exists. Low-powered studies frequently fail to detect real effects (Type II error) and waste resources. The standard is 80% power, meaning 80% chance of detecting the effect if it is real.
Power depends on four factors: the effect size (how large the difference is), the sample size (more participants = more power), the significance threshold (α), and the variability of the outcome measure. The power calculation table above shows minimum sample sizes for common scenarios.
Rule of thumb for A/B tests (proportions): To detect a relative lift of 10% on a 5% base conversion rate (i.e. 5% to 5.5%), you need approximately 30,000 visitors per variant at 95% confidence and 80% power. Much smaller lifts or low base rates require very large samples.
Frequently Asked Questions
What does statistically significant mean? +
A result is statistically significant when it is unlikely to have occurred by chance alone. The standard threshold is p < 0.05 (5%), meaning less than a 5% probability the observed difference is due to random variation rather than a real effect. In A/B testing, it means you can confidently say the two variants perform differently — though you should also consider whether the difference is large enough to be practically meaningful.
What is a p-value? +
A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis (no real effect) is true. A p-value of 0.03 means there is a 3% chance of seeing such a large difference if the treatments were identical. Lower p-values = stronger evidence against the null hypothesis. Common thresholds: p < 0.10 (90% confidence), p < 0.05 (95%), p < 0.01 (99%).
What is the formula for statistical significance? +
For two proportions (conversion rates): z = (p1 − p2) / √[p̂(1−p̂)(1/n1 + 1/n2)], where p̂ = (x1 + x2)/(n1 + n2) is the pooled proportion. Compare the z-score to 1.96 for 95% significance (two-tailed). If |z| ≥ 1.96, the result is statistically significant at the 95% level.
How do I know if my A/B test is significant? +
Enter your control and test group visitors and conversions into this calculator. If the p-value is below 0.05 and the verdict shows "Statistically Significant", you can conclude the difference is unlikely to be due to chance. Also check: (1) your sample size is sufficient using the power table, (2) you ran the test for a full business cycle, and (3) the effect size is large enough to be practically useful.
What is the difference between one-tailed and two-tailed tests? +
A two-tailed test checks for a difference in either direction (B could be better or worse than A). A one-tailed test only checks for an improvement in one direction (B is better than A). One-tailed tests are more powerful (easier to reach significance) but should only be used when you are certain the effect can only go one way. In most A/B testing scenarios, use two-tailed tests to avoid inflating false positive rates.
What is statistical meaningfulness? +
Statistical meaningfulness (also called practical or clinical significance) asks whether the observed effect is large enough to be important in practice. A very large sample can make a tiny 0.01% lift statistically significant (p < 0.05), but that microscopic improvement may not justify implementation costs or risk. Always consider effect size alongside p-values. Useful effect size measures include Cohen's d, relative risk, and odds ratios.