Majorka

Loading…

Academy › Module 11 — Conversion Rate Optimization & UX Testing

When to Stop Testing — Statistical Significance for Operators

11 min · interactive · Intermediate

You run a test for 2 weeks, see a 1.0% lift, and declare victory. But the math says you need 4 weeks of data to be 80% confident. You ship, it reverts to the control. You wasted 2 weeks. Here is the actual math: chi-squared, sample size, minimum detectable effect — without the jargon.

The statistical significance problem

Most operators either:

Test forever (chasing 95% confidence on 50 orders per variation)
Test too short (ship on 10 orders per variation)

Neither works. You need a middle ground: 80% confidence with 200-400 total orders.

The actual math (simplified)

You run a test with two variations:

Control: 100 orders, 2.5% conversion (2.5 orders)
Test: 100 orders, 3.0% conversion (3 orders)

Did the test win, or is the difference random noise?

Use the chi-squared test (not Bayesian, because it is simpler).

Chi-squared formula (the practical version)

You do not need to calculate this. Use this free tool: https://www.evanmiller.org/ab-testing/chi-squared.html

Input:

Control: 100 visitors, 2.5 conversions (the number of orders, not the rate)
Variation: 100 visitors, 3 conversions
Hit "Calculate"

Output:

P-value: 0.38
Confidence: 62% (below 80%, NOT a win)

Interpretation: There is a 62% chance the test is actually better. A 38% chance it is random noise. At 80% threshold, this is a loss. Keep the control.

Same test, different results

Control: 200 visitors, 5 conversions (2.5%) Test: 200 visitors, 8 conversions (4.0%)

Chi-squared output:

P-value: 0.08
Confidence: 92%

Interpretation: There is a 92% chance the test is truly better. This is a win. Ship the test.

Minimum Detectable Effect (MDE)

MDE is the smallest conversion lift you care about detecting.

For 200 total orders (100 control, 100 test):

At 80% confidence, your MDE is ~2.0% absolute lift
At 85% confidence, your MDE is ~1.8% absolute lift

What this means: If your test shows a 1.5% lift, you cannot confidently declare it a win with 200 orders. You need either:

More orders (400+), OR
A higher observed lift (2.0%+)

Sample size: how many orders do you need?

The more orders you collect, the smaller the lift you can detect.

Total Orders	Minimum detectable lift @ 80%	Minimum detectable lift @ 85%
100	2.5%	2.3%
200	1.8%	1.6%
400	1.2%	1.1%
1,000	0.75%	0.65%
2,000	0.5%	0.45%

Practical rule:

< 200 orders total: your MDE is 1.8%+. Only ship if you see > 2.0% lift.
200-400 orders: your MDE is 1.2-1.8%. Ship if you see > 1.5% lift.
> 400 orders: your MDE is 1.2% or lower. Ship if you see > 1.0% lift.

The decision framework

Week 1-2: Run the test

Control: 50% traffic
Test: 50% traffic
Collect minimum 100 orders per variation (200 total)

Week 3: Analyze the results

Calculate chi-squared p-value at https://www.evanmiller.org/ab-testing/chi-squared.html
If p-value < 0.2 (> 80% confidence) AND observed lift > 1.5%, SHIP
If p-value > 0.2 (< 80% confidence) AND observed lift < 1.5%, REVERT to control
If borderline (p-value 0.15-0.25 AND lift 1.0-1.8%), run for one more week

Week 4: Final decision

Collect another 50-100 orders per variation
Re-calculate p-value
Make final decision (ship or revert)

Common statistical mistakes

Mistake 1: "Peeking" at results and stopping early

Wrong: Run test for 1 week, see 2.0% lift, declare victory. Why wrong: You are oversampling lucky wins (selection bias).

Right: Pre-commit to 2 weeks. Do not look at results until day 14.

Mistake 2: Running sequential tests without correction

Wrong: Test 1 (p=0.18, 82% confidence, ship). Test 2 (p=0.20, 80% confidence, ship). Accumulated error: actual confidence is only 75%.

Why wrong: Multiple tests inflate your false positive rate.

Right: Run one test at a time. Hit your decision threshold and move on.

Mistake 3: Ignoring the baseline conversion rate

Wrong: Your store converts at 0.8%, you run a test that "lifts" to 1.0% (25% relative lift). You ship it.

Why wrong: At 0.8% baseline, a 0.2% absolute lift requires 400+ orders to detect reliably.

Right: At low baselines (< 1.5%), expect to need larger sample sizes or higher observed lifts.

AU-specific timing considerations

Australia has weekly traffic patterns:

Weekdays: lower traffic, lower conversion
Weekends: higher traffic, often higher conversion
Thursday-Sunday: peak retail traffic

Run tests for at least 2 weeks to average out weekly variation.

If you run a test Tuesday-Thursday, you miss the weekend spike. Run Mon-Sun, then another Mon-Sun.

The stats checklist

Before shipping a test result:

[ ] Minimum 200 total orders collected (100+ per variation)
[ ] Test ran for at least 2 calendar weeks (includes weekend traffic)
[ ] P-value < 0.2 (80% confidence)
[ ] Observed lift > 1.5% absolute (or > MDE for your sample size)
[ ] No "peeking"—analyzed results only at week 2 decision point
[ ] Control and test had equal traffic split (50/50)

Why operators fail at testing

Failure mode 1: Impatience "I see a 1.0% lift after 1 week, I'll ship it." Result: Reverts to control after shipping. Wasted 1 week.

Failure mode 2: Overthinking "I need 95% confidence, which requires 4 months." Result: Ship nothing. Learn nothing.

Failure mode 3: Wrong threshold "P-value < 0.05 is the gold standard." (This is academic practice.) Result: Never declare a winner until 4 months of testing.

Right approach: 80% confidence, 1.5% absolute lift threshold, 2-week minimum. Ship, learn, iterate.

How to get to statistical significance fast

To minimize testing duration:

Test high-impact variables: Headline, image order, checkout steps.

- These move the needle by 0.5-1.5% - You hit MDE in 2 weeks

Avoid low-impact variables: Button color, font choice, comma usage.

- These move the needle by 0.0-0.1% - You need months of data to detect

Increase traffic to the test: Run the test only on hot products, not all products.

- Hot product: 100 orders/month - Test: 50 orders per variation per week = 1 week to 100 orders total - Decision point: week 1, not week 4

Multi-variant testing for similar variables:

- Instead of testing "headline A" vs "headline B" (binary), test "headings A, B, C, D" vs control - You hit significance faster because the winner is more extreme

The testing calendar

Month 1:

Week 1: Identify hypothesis from heatmap/analytics
Week 2-3: Run test (headline or image)
Week 4: Decision and ship

Month 2:

Week 1: Identify next hypothesis
Week 2-3: Run test (video or CTA)
Week 4: Decision and ship

Month 3:

Repeat

If you follow this rhythm, you run 12 tests per year. At 30-40% win rate (hypothesis-driven), you ship 4-5 winning tests per year. Each winner adds 0.3-0.8% to conversion. Cumulative: 1.2-4.0% annual conversion gain.

From 1.5% to 3.5%+ is achievable in 12-18 months with disciplined testing.

Statistical significance in practice: three tests from a kitchen gadget store

A kitchen gadget store (vegetable chopper, A