Test statistical significance of two samples. Part of the DevTools Surf developer suite. Browse more tools in the Statistics collection.
Use Cases
Test whether two sample populations have significantly different means.
Determine whether an A/B test result is statistically significant before shipping a variant.
Validate that a clinical trial result is not due to random chance.
Compare conversion rates between a control group and multiple treatment groups.
Tips
Set your significance level (alpha) before seeing the data — post-hoc alpha selection is p-hacking, which inflates false positive rates.
Statistical significance is not the same as practical significance — a p-value of 0.001 on a 0.1% conversion lift is significant but probably not worth shipping.
For large samples (n > 1000), almost any difference will be statistically significant — always pair significance with effect size to assess practical importance.
Fun Facts
The 0.05 significance threshold was proposed by Ronald Fisher in his 1925 'Statistical Methods for Research Workers' as a convenient conventional cutoff — not a scientifically derived threshold. Fisher himself later cautioned against its rigid application.
A 2015 analysis of psychology journals found that 96% of papers with significant results used a p-value threshold of exactly 0.05 — evidence of publication bias and p-hacking.
The American Statistical Association issued a rare public statement in 2016 warning against misuse of p-values, stating explicitly that 'scientific conclusions should not be based only on whether a p-value passes a specific threshold.'
FAQ
What is a p-value?
The probability of observing a result at least as extreme as the measured one, assuming the null hypothesis is true. A p-value of 0.05 means a 5% chance of the observed difference occurring by chance if there is no real effect.
What's the difference between one-tailed and two-tailed tests?
A two-tailed test detects differences in either direction (A > B or A < B). A one-tailed test only detects differences in the specified direction. Use two-tailed by default unless you have a strong prior reason to test only one direction.