Academy ›
Module 11 — Conversion Rate Optimization & UX Testing
When to Stop Testing — Statistical Significance for Operators
11 min · interactive · Intermediate
You run a test for 2 weeks, see a 1.0% lift, and declare victory. But the math says you need 4 weeks of data to be 80% confident. You ship, it reverts to the control. You wasted 2 weeks. Here is the actual math: chi-squared, sample size, minimum detectable effect — without the jargon.
The statistical significance problem
Most operators either:
- Test forever (chasing 95% confidence on 50 orders per variation)
- Test too short (ship on 10 orders per variation)
Neither works. You need a middle ground: 80% confidence with 200-400 total orders.
The actual math (simplified)
You run a test with two variations:
- Control: 100 orders, 2.5% conversion (2.5 orders)
- Test: 100 orders, 3.0% conversion (3 orders)
Did the test win, or is the difference random noise?
Use the chi-squared test (not Bayesian, because it is simpler).
Chi-squared formula (the practical version)
You do not need to calculate this. Use this free tool: https://www.evanmiller.org/ab-testing/chi-squared.html
Input:
- Control: 100 visitors, 2.5 conversions (the number of orders, not the rate)
- Variation: 100 visitors, 3 conversions
- Hit "Calculate"
Output:
- P-value: 0.38
- Confidence: 62% (below 80%, NOT a win)
Interpretation: There is a 62% chance the test is actually better. A 38% chance it is random noise. At 80% threshold, this is a loss. Keep the control.
Same test, different results
Control: 200 visitors, 5 conversions (2.5%) Test: 200 visitors, 8 conversions (4.0%)
Chi-squared output:
- P-value: 0.08
- Confidence: 92%
Interpretation: There is a 92% chance the test is truly better. This is a win. Ship the test.
Minimum Detectable Effect (MDE)
MDE is the smallest conversion lift you care about detecting.
For 200 total orders (100 control, 100 test):
- At 80% confidence, your MDE is ~2.0% absolute lift
- At 85% confidence, your MDE is ~1.8% absolute lift
What this means: If your test shows a 1.5% lift, you cannot confidently declare it a win with 200 orders. You need either:
- More orders (400+), OR
- A higher observed lift (2.0%+)
Sample size: how many orders do you need?
The more orders you collect, the smaller the lift you can detect.
| Total Orders | Minimum detectable lift @ 80% | Minimum detectable lift @ 85% |
|---|
| 100 | 2.5% | 2.3% |
| 200 | 1.8% | 1.6% |
| 400 | 1.2% | 1.1% |
| 1,000 | 0.75% | 0.65% |
| 2,000 | 0.5% | 0.45% |
Practical rule:
- < 200 orders total: your MDE is 1.8%+. Only ship if you see > 2.0% lift.
- 200-400 orders: your MDE is 1.2-1.8%. Ship if you see > 1.5% lift.
- > 400 orders: your MDE is 1.2% or lower. Ship if you see > 1.0% lift.
The decision framework
Week 1-2: Run the test
- Control: 50% traffic
- Test: 50% traffic
- Collect minimum 100 orders per variation (200 total)
Week 3: Analyze the results
- Calculate chi-squared p-value at https://www.evanmiller.org/ab-testing/chi-squared.html
- If p-value < 0.2 (> 80% confidence) AND observed lift > 1.5%, SHIP
- If p-value > 0.2 (< 80% confidence) AND observed lift < 1.5%, REVERT to control
- If borderline (p-value 0.15-0.25 AND lift 1.0-1.8%), run for one more week
Week 4: Final decision
- Collect another 50-100 orders per variation
- Re-calculate p-value
- Make final decision (ship or revert)
Common statistical mistakes
Mistake 1: "Peeking" at results and stopping early
Wrong: Run test for 1 week, see 2.0% lift, declare victory. Why wrong: You are oversampling lucky wins (selection bias).
Right: Pre-commit to 2 weeks. Do not look at results until day 14.
Mistake 2: Running sequential tests without correction
Wrong: Test 1 (p=0.18, 82% confidence, ship). Test 2 (p=0.20, 80% confidence, ship). Accumulated error: actual confidence is only 75%.
Why wrong: Multiple tests inflate your false positive rate.
Right: Run one test at a time. Hit your decision threshold and move on.
Mistake 3: Ignoring the baseline conversion rate
Wrong: Your store converts at 0.8%, you run a test that "lifts" to 1.0% (25% relative lift). You ship it.
Why wrong: At 0.8% baseline, a 0.2% absolute lift requires 400+ orders to detect reliably.
Right: At low baselines (< 1.5%), expect to need larger sample sizes or higher observed lifts.
AU-specific timing considerations
Australia has weekly traffic patterns:
- Weekdays: lower traffic, lower conversion
- Weekends: higher traffic, often higher conversion
- Thursday-Sunday: peak retail traffic
Run tests for at least 2 weeks to average out weekly variation.
If you run a test Tuesday-Thursday, you miss the weekend spike. Run Mon-Sun, then another Mon-Sun.
The stats checklist
Before shipping a test result:
- [ ] Minimum 200 total orders collected (100+ per variation)
- [ ] Test ran for at least 2 calendar weeks (includes weekend traffic)
- [ ] P-value < 0.2 (80% confidence)
- [ ] Observed lift > 1.5% absolute (or > MDE for your sample size)
- [ ] No "peeking"—analyzed results only at week 2 decision point
- [ ] Control and test had equal traffic split (50/50)
Why operators fail at testing
Failure mode 1: Impatience "I see a 1.0% lift after 1 week, I'll ship it." Result: Reverts to control after shipping. Wasted 1 week.
Failure mode 2: Overthinking "I need 95% confidence, which requires 4 months." Result: Ship nothing. Learn nothing.
Failure mode 3: Wrong threshold "P-value < 0.05 is the gold standard." (This is academic practice.) Result: Never declare a winner until 4 months of testing.
Right approach: 80% confidence, 1.5% absolute lift threshold, 2-week minimum. Ship, learn, iterate.
How to get to statistical significance fast
To minimize testing duration:
- Test high-impact variables: Headline, image order, checkout steps.
- These move the needle by 0.5-1.5% - You hit MDE in 2 weeks
- Avoid low-impact variables: Button color, font choice, comma usage.
- These move the needle by 0.0-0.1% - You need months of data to detect
- Increase traffic to the test: Run the test only on hot products, not all products.
- Hot product: 100 orders/month - Test: 50 orders per variation per week = 1 week to 100 orders total - Decision point: week 1, not week 4
- Multi-variant testing for similar variables:
- Instead of testing "headline A" vs "headline B" (binary), test "headings A, B, C, D" vs control - You hit significance faster because the winner is more extreme
The testing calendar
Month 1:
- Week 1: Identify hypothesis from heatmap/analytics
- Week 2-3: Run test (headline or image)
- Week 4: Decision and ship
Month 2:
- Week 1: Identify next hypothesis
- Week 2-3: Run test (video or CTA)
- Week 4: Decision and ship
Month 3:
If you follow this rhythm, you run 12 tests per year. At 30-40% win rate (hypothesis-driven), you ship 4-5 winning tests per year. Each winner adds 0.3-0.8% to conversion. Cumulative: 1.2-4.0% annual conversion gain.
From 1.5% to 3.5%+ is achievable in 12-18 months with disciplined testing.
Statistical significance in practice: three tests from a kitchen gadget store
A kitchen gadget store (vegetable chopper, A