1. Maths Calculators
  2. Statistical Significance Calculator

Statistical Significance Calculator

Calculate p-value and statistical significance for your A/B test using a two-proportion z-test. Enter your control and test group data to find out if your result is statistically significant at 90%, 95% or 99% confidence — with relative uplift and confidence interval.

Reviewed by Mustafa Bilgic, Statistics Specialist Two-Proportion Z-Test Free to Use

A/B Test Significance Calculator

Control Group (A)
Conversion rate: -
Test Group (B)
Conversion rate: -
Method: Two-proportion z-test. Works best when both groups have at least 30 conversions each. For smaller samples or non-conversion metrics, a t-test or chi-squared test may be more appropriate.

Power Calculation Table — Minimum Sample Size Required

How many participants do you need per group to detect an effect of a given size? Based on 80% statistical power (standard in research) and two-tailed tests.

Effect Size (Cohen's d) Description n per group (p=0.05, 80% power) n per group (p=0.05, 90% power) n per group (p=0.01, 80% power) n per group (p=0.01, 90% power)
0.10 Very small 1,571 2,102 2,643 3,327
0.20 Small 394 527 662 832
0.30 Small-medium 176 235 295 371
0.50 Medium 64 85 107 135
0.80 Large 26 34 43 54
1.00 Very large 17 22 28 35
1.20 Huge 12 16 20 25

Cohen's d = (mean1 − mean2) / pooled standard deviation. Small = 0.2, Medium = 0.5, Large = 0.8 (Cohen, 1988). Total sample size = 2 × n per group.

What is Statistical Significance?

Statistical significance is a way of deciding whether an observed result — a difference between two groups, a treatment effect, a conversion rate improvement — is likely to be real, or whether it could easily have happened by random chance even if there were no real effect.

When we run an A/B test, clinical trial, or any experiment, we collect data from samples. Samples always vary by chance. The question is: is the difference we see large enough, relative to how much random variation we expect, that we can conclude the effect is real? That is what statistical significance tells us.

The conventional threshold is a p-value below 0.05 (5%). This means: if the null hypothesis were true (no real difference), there would be less than a 5% chance of seeing a result this extreme just by chance. Because 5% is small enough that we are willing to reject the null hypothesis and declare the result significant.

What is a p-value? Plain English Explanation

The p-value is the probability of seeing your data (or more extreme data) if there were truly no difference between the groups. It does NOT tell you the probability that your hypothesis is true — a common misconception.

  • p = 0.03: Only a 3% chance of seeing this result if there were really no effect. Strong evidence against the null hypothesis. Significant at the 95% level.
  • p = 0.08: An 8% chance of seeing this by chance. Not significant at p<0.05, but might be significant at p<0.10 depending on your chosen threshold.
  • p = 0.45: A 45% chance of seeing this by chance. No meaningful evidence of a real difference.

The p-value alone does not tell you how large the effect is, how important it is in practice, or whether your study was designed well. It is one piece of evidence in a broader statistical analysis.

The Test Statistic Formula (Two-Proportion Z-Test)

For A/B testing with conversion rates (proportions), the two-proportion z-test is the standard approach. The formula for the test statistic (z-score) is:

z = (p₁ − p₂) ÷ √[p̂ · (1 − p̂) · (1/n₁ + 1/n₂)]

where p̂ = (x₁ + x₂) ÷ (n₁ + n₂) (pooled proportion)

Where:

  • p₁, p₂ = conversion rates of group A and group B
  • n₁, n₂ = sample sizes of each group
  • x₁, x₂ = number of conversions in each group
  • p̂ = pooled (combined) conversion rate under the null hypothesis

The z-score is then compared to the standard normal distribution critical values:

  • |z| ≥ 1.645: Significant at 90% confidence (two-tailed)
  • |z| ≥ 1.960: Significant at 95% confidence (two-tailed)
  • |z| ≥ 2.576: Significant at 99% confidence (two-tailed)

Worked Example: A/B Test Calculation

Control (A): 10,000 visitors, 500 conversions (5.0%). Test (B): 10,000 visitors, 600 conversions (6.0%).

Step 1: Calculate pooled proportion

p̂ = (500 + 600) / (10000 + 10000) = 1100 / 20000 = 0.055

Step 2: Calculate standard error

SE = √[0.055 × 0.945 × (1/10000 + 1/10000)] = √[0.00001040] = 0.003225

Step 3: Calculate z-score

z = (0.06 − 0.05) / 0.003225 = 0.01 / 0.003225 = 3.10

Step 4: Find p-value. |z| = 3.10 > 2.576 → Significant at 99% confidence

Relative uplift: (6.0% − 5.0%) / 5.0% = +20%

95% CI for difference: 0.01 ± 1.96 × 0.003225 = [0.0037, 0.0163] = [0.37 pp, 1.63 pp]

95% Confidence Level Explained

A 95% confidence level means: if we ran the same experiment 100 times, 95 of those experiments would produce a confidence interval that contains the true population value. It does NOT mean there is a 95% probability the true value is in this specific interval — it either is or is not.

The 95% confidence level is the industry standard for most A/B testing, scientific research, and medical trials. Some industries use 99% (e.g. pharmaceutical trials) for higher certainty; some use 90% (e.g. early-stage product testing) to accept more risk.

Type I and Type II Errors

There are two ways a significance test can lead you astray:

Error Type What Happened Consequence Controlled by
Type I Error (False Positive) Declared significant, but no real effect Implement a change that does not help Significance level (α). At p<0.05, 5% Type I error rate.
Type II Error (False Negative) Not significant, but a real effect exists Miss a winning improvement Statistical power (1−β). Standard is 80% power = 20% Type II rate.

Lowering your significance threshold (e.g. using p<0.01 instead of p<0.05) reduces Type I errors but increases Type II errors, requiring larger sample sizes to compensate. There is always a trade-off.

Statistical Significance in A/B Testing

In digital marketing and product development, A/B testing (split testing) is used to determine whether a change to a webpage, email, or product feature improves a key metric such as conversion rate, click-through rate, or revenue per user.

Best practices for statistically valid A/B tests:

  • Determine the minimum sample size before running the test using a power calculation
  • Do not stop the test early just because you see significance — this inflates false positive rates (peeking problem)
  • Run the test for at least one full business cycle (typically one or two weeks to account for weekday/weekend differences)
  • Use 95% confidence (p<0.05) as the minimum threshold; 99% for high-stakes decisions
  • Consider practical significance: a statistically significant 0.01% uplift may not justify the cost of implementation
  • Segment your results: a test may be significant overall but driven by one user segment

Statistical Significance in Medicine and Clinical Trials

In clinical research, statistical significance has life-or-death implications. The FDA and EMA (European Medicines Agency) require p<0.05 as a minimum, with most pivotal trials required to demonstrate p<0.001 for primary endpoints. Clinical significance (the size of the effect and its clinical meaningfulness) is given equal or greater weight.

The concept of statistical meaningfulness — also called clinical relevance or practical significance — asks: is the effect large enough to make a real difference to patients? A blood pressure drug that lowers systolic pressure by 0.5 mmHg may be statistically significant in a large trial but clinically meaningless. A 20 mmHg reduction would be both statistically and clinically significant.

Power Calculation: How Many Participants Do You Need?

Statistical power is the probability of detecting a real effect when one exists. Low-powered studies frequently fail to detect real effects (Type II error) and waste resources. The standard is 80% power, meaning 80% chance of detecting the effect if it is real.

Power depends on four factors: the effect size (how large the difference is), the sample size (more participants = more power), the significance threshold (α), and the variability of the outcome measure. The power calculation table above shows minimum sample sizes for common scenarios.

Rule of thumb for A/B tests (proportions): To detect a relative lift of 10% on a 5% base conversion rate (i.e. 5% to 5.5%), you need approximately 30,000 visitors per variant at 95% confidence and 80% power. Much smaller lifts or low base rates require very large samples.

Frequently Asked Questions

What does statistically significant mean? +

A result is statistically significant when it is unlikely to have occurred by chance alone. The standard threshold is p < 0.05 (5%), meaning less than a 5% probability the observed difference is due to random variation rather than a real effect. In A/B testing, it means you can confidently say the two variants perform differently — though you should also consider whether the difference is large enough to be practically meaningful.

What is a p-value? +

A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis (no real effect) is true. A p-value of 0.03 means there is a 3% chance of seeing such a large difference if the treatments were identical. Lower p-values = stronger evidence against the null hypothesis. Common thresholds: p < 0.10 (90% confidence), p < 0.05 (95%), p < 0.01 (99%).

What is the formula for statistical significance? +

For two proportions (conversion rates): z = (p1 − p2) / √[p̂(1−p̂)(1/n1 + 1/n2)], where p̂ = (x1 + x2)/(n1 + n2) is the pooled proportion. Compare the z-score to 1.96 for 95% significance (two-tailed). If |z| ≥ 1.96, the result is statistically significant at the 95% level.

How do I know if my A/B test is significant? +

Enter your control and test group visitors and conversions into this calculator. If the p-value is below 0.05 and the verdict shows "Statistically Significant", you can conclude the difference is unlikely to be due to chance. Also check: (1) your sample size is sufficient using the power table, (2) you ran the test for a full business cycle, and (3) the effect size is large enough to be practically useful.

What is the difference between one-tailed and two-tailed tests? +

A two-tailed test checks for a difference in either direction (B could be better or worse than A). A one-tailed test only checks for an improvement in one direction (B is better than A). One-tailed tests are more powerful (easier to reach significance) but should only be used when you are certain the effect can only go one way. In most A/B testing scenarios, use two-tailed tests to avoid inflating false positive rates.

What is statistical meaningfulness? +

Statistical meaningfulness (also called practical or clinical significance) asks whether the observed effect is large enough to be important in practice. A very large sample can make a tiny 0.01% lift statistically significant (p < 0.05), but that microscopic improvement may not justify implementation costs or risk. Always consider effect size alongside p-values. Useful effect size measures include Cohen's d, relative risk, and odds ratios.

Common Statistical Tests: Which One to Use?

The two-proportion z-test used by this calculator is the correct choice when comparing conversion rates or proportions between two groups. However, different types of data and research questions call for different tests.

Situation Correct Test Example Use Case
Comparing two proportions / conversion rates Two-proportion z-test (this calculator) A/B test: Does version B have a higher click-through rate than A?
Comparing means of two independent groups, large sample Two-sample z-test for means Do men and women differ in average spending?
Comparing means of two independent groups, small sample Independent samples t-test Does drug A reduce blood pressure more than drug B? (n < 30)
Comparing means before and after treatment (same subjects) Paired samples t-test Does a training programme improve test scores?
Comparing three or more group means ANOVA (Analysis of Variance) Do four different pricing strategies lead to different revenue?
Association between two categorical variables Chi-squared test Is there a relationship between gender and product preference?
Non-normal data, comparing two groups Mann-Whitney U test Comparing customer satisfaction ratings (ordinal data)
Correlation between two continuous variables Pearson or Spearman correlation Is page load time related to bounce rate?

Bayesian vs Frequentist A/B Testing

The calculator above uses the frequentist approach (null hypothesis significance testing). An alternative is Bayesian A/B testing, which calculates the probability that variant B is better than A, given the data observed.

Frequentist (this calculator):

  • Returns a p-value: probability of seeing the data if there were no effect
  • Clear threshold (p<0.05 = significant)
  • More established in science and regulation
  • Subject to peeking problems if you check results early

Bayesian approach:

  • Returns a probability that B is better than A (e.g. "96% chance B beats A")
  • Incorporates prior knowledge (prior probability)
  • More intuitive for business decisions
  • Less sensitive to the peeking problem
  • Results depend on the prior, which can be subjective

Both approaches are valid. Many modern A/B testing platforms (such as Google Optimize, VWO, and Optimizely) offer both methods. For regulatory submissions (pharmaceutical, clinical), frequentist testing with pre-specified hypotheses and sample sizes remains the standard.

Multiple Testing Problem

If you run many A/B tests or check many metrics simultaneously, you will see false positives simply by chance. With a 5% significance level, 1 in 20 comparisons will be "significant" even if all null hypotheses are true. This is the multiple testing problem.

Common corrections include:

  • Bonferroni correction: Divide the significance level by the number of tests. Testing 10 metrics at once: use p<0.005 instead of p<0.05.
  • Benjamini-Hochberg (FDR control): Controls the False Discovery Rate — less conservative than Bonferroni, preferred for many simultaneous tests.
  • Sequential testing / alpha spending: Allows early stopping while controlling overall error rate. Used in adaptive clinical trials.

In practice: pre-specify one primary metric, use secondary metrics for exploration only, and apply corrections if you must test multiple primary outcomes.

Effect Size Measures

Alongside p-values, effect size measures tell you how large the difference is — independent of sample size. The most common measures are:

  • Cohen's d: Standardised mean difference. Small = 0.2, Medium = 0.5, Large = 0.8. Used for comparing means.
  • Relative risk (RR): Ratio of event rates. RR = 2 means the event is twice as likely in one group. Common in medicine.
  • Odds ratio (OR): Ratio of odds. OR = 1 means no effect. Widely used in logistic regression and epidemiology.
  • Relative uplift (%): (p2 − p1) / p1 × 100. Used in marketing A/B tests. A 20% relative uplift on a 5% conversion rate means the new rate is 6%.
  • Absolute risk reduction (ARR): p1 − p2. The direct difference in rates. Useful for understanding practical impact: a 5% ARR in a clinical trial means 5 fewer events per 100 treated.