What p-value should I use for A/B tests?

p < 0.05 (95% confidence) is the standard threshold for most product experiments. Financial or medical decisions warrant p < 0.01. Using a looser threshold increases the risk of shipping ineffective changes.

How long should I run an A/B test?

Until you reach the pre-calculated sample size for your desired statistical power (typically 80%). Stopping early when results look positive — called 'peeking' — inflates false positives significantly.

What is the difference between statistical and practical significance?

Statistical significance means the result is unlikely due to chance. Practical significance means the effect size is large enough to matter to your business. A test can be statistically significant but practically irrelevant if the lift is 0.01%.

DevTools Surf

A/B Test Calculator

This tool is not yet implemented.

About A/B Test Calculator

A/B Test Calculator preview - Statistics tool

Calculate A/B test results and statistical significance. Part of the DevTools Surf developer suite. Browse more tools in the Statistics collection.

Use Cases

Determine whether a landing page headline change lifts conversions
Validate checkout flow redesigns before full rollout
Measure statistical significance of email subject line experiments
Evaluate feature flag impact on retention metrics

Tips

Enter raw visitor and conversion counts — not rates — for the most accurate significance calculation
Check the confidence level selector: 95% is the industry standard, but safety-critical tests should use 99%
Run the test until the recommended sample size is reached before reading the result; peeking early inflates false positives

Fun Facts

A/B testing at scale was pioneered by Google in 2000 when they tested 41 shades of blue to optimize ad click-through rates — a decision now famous in UX lore.
The statistical test underlying most A/B tools is a two-proportion z-test, first described by Karl Pearson in the late 1800s — long before the web existed.
Microsoft's experimentation platform runs over 10,000 A/B tests simultaneously across its products, making it one of the largest continuous testing operations in the world.

FAQ

What p-value should I use for A/B tests?: p < 0.05 (95% confidence) is the standard threshold for most product experiments. Financial or medical decisions warrant p < 0.01. Using a looser threshold increases the risk of shipping ineffective changes.
How long should I run an A/B test?: Until you reach the pre-calculated sample size for your desired statistical power (typically 80%). Stopping early when results look positive — called 'peeking' — inflates false positives significantly.
What is the difference between statistical and practical significance?: Statistical significance means the result is unlikely due to chance. Practical significance means the effect size is large enough to matter to your business. A test can be statistically significant but practically irrelevant if the lift is 0.01%.

Related Statistics Tools

Statistical Significance Tester Confidence Interval Calculator Sample Size Calculator Chi-Square Tester T-Test Calculator Correlation Analyzer Regression Analyzer Distribution Analyzer

DevTools Surf