A/B Testing¶

A/B testing (more formally, controlled online experimentation) is the practice of showing different versions of a product to different users and measuring which produces better outcomes. The strength of the technique is that it converts product questions from arguments about opinions into measurements of behavior. The limitation is that the answer is only as good as the question, the metric, the design of the experiment, and the team's ability to interpret what came back.

Done well, A/B testing accelerates learning. Done poorly, it produces a series of statistically dressed-up wrong answers, often delivered with the false confidence that numbers tend to inspire.¹

What A/B Testing Is Good For¶

Some product questions are well-shaped for experimentation:

Onboarding flows. Small variations in copy, sequencing, or defaults produce measurable differences in completion. The user volume is usually large enough to detect them.
Call-to-action placement and wording. Whether users click, where, and how often. Easy to measure, easy to vary, fast to read.
Workflow simplification. Whether removing a step (or adding a confirmation, or changing a default) improves completion without increasing errors.
Messaging and content. Comparing different framings of the same idea to see which one users respond to.
Feature adoption. Whether changing how a feature is surfaced increases the number of users who try it.
Pricing and packaging. When done with appropriate care, and legal review: whether different pricing structures, plan layouts, or upgrade prompts change conversion.

In each case, the question is bounded, the metric is clear, the change is small enough to attribute, and the user volume is large enough to produce a statistically meaningful answer.

When A/B Testing Doesn't Work¶

A/B testing is the wrong tool for many of the questions teams want to ask of it:

Low-traffic experiences. Detecting small effects requires large samples. A B2B product with a few hundred customers may not see meaningful signal on most changes for months, if ever.
Major redesigns. A whole-product redesign cannot be cleanly tested against the previous design, because the change is too large and the effects too tangled.
Long-term effects. A/B tests measure short-term metrics. They miss the effects that unfold over months, such as customer trust, brand perception, or retention beyond the test window.
Strategic questions. "Should we enter this market?" or "is this the right product direction?" are not A/B test questions. They are bets, and bets cannot be experimented down to certainty.
Questions where the metric is unclear. If the team cannot specify what success would look like, no experiment can produce it. The work is to define the metric first, not to instrument harder.
Questions where the experiment changes the question. Some experiences (legal disclosures, security messaging, premium-quality positioning) lose meaning when treated as variables. Some changes the team should make for reasons other than measured outcome.

Common Pitfalls¶

The statistical and methodological pitfalls of online experimentation are well-documented and persistently violated:¹

Peeking at results. Stopping a test as soon as it looks significant inflates the false-positive rate dramatically. Tests need to run their planned duration.
Multiple comparisons. Running many tests at once, on overlapping users, with shared metrics, produces spurious "winners" by chance. Correction is required, and most teams do not apply it.
Novelty effects. Users react to anything new. A short test may show a difference that is the change itself, not a genuine preference for the new version. Tests need to run long enough for novelty to wear off.
Selection effects. The users who happen to be in the test cohort may not be representative. Randomization needs to be correct, and the team needs to verify it.
Wrong primary metric. A test that optimizes click-through rate may be reducing retention. The metric that is moved is not always the metric the team cares about.
Underpowered tests. A test with insufficient sample size cannot detect the effect even if it exists. Negative results from underpowered tests do not mean "no effect"; they mean "no detection."
Treating non-significant results as proof of no effect. A non-significant test is the absence of evidence, not evidence of absence.

What This Looks Like in Practice¶

Define the metric and the success threshold before the test starts. Writing down what would count as a win, in advance, protects the team from interpreting noisy results favorably after the fact.
Estimate the required sample size before running. Power analysis is not optional. A test that cannot detect a useful effect at the team's traffic volume should not be run.
Run tests for a full business cycle. Day-of-week and seasonal effects produce variation that can swamp the experimental signal. The right test duration is usually longer than the team's first instinct.
Pre-register the analysis. What metrics will be looked at, what segments, what thresholds. Post-hoc fishing in the data is how false positives become product decisions.
Watch the secondary metrics. A test that wins on the primary metric and loses on three secondary metrics is probably not actually a win. The full picture matters.
Combine with qualitative research. Why did the variant win? Behavior tells you what; users tell you why. The pair is more useful than either alone.
Resist the temptation to test everything. Some changes are too small to be worth the experimental overhead. Some are too important to wait for an experiment. The discipline is using the right tool for the question.

Tradeoff

A/B testing requires enough traffic and careful interpretation. It is not a substitute for product judgment. A team that runs experiments without judgment is a team that gets misled with precision. A team that has judgment but no experiments is a team that mistakes confidence for evidence.

Ron Kohavi, Diane Tang, and Ya Xu, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press, 2020). The reference text for experimentation done correctly, covering statistical foundations, design choices, common pitfalls, and the cultural and organizational prerequisites for an experimentation program to actually inform decisions. ↩↩