When should I AB test?
I’ve been working on growth and user acquisition in startups for a little over three years. If I’ve learned anything, it’s that growth doesn’t happen in a vacuum. It’s heavily influenced by how the product works, how it looks and feels, and how quickly it can bring value to a user. You can’t sustain growth by throwing money at sales or marketing; you must take a more holistic approach. Because of that, I’ve spent a lot of time studying how users find the product, how they use it, what gets them to convert, and what gets them to stick around.
Startups do similar research all across the company. The company needs to learn its product-market fit. It needs to discover an effective business model. It needs to learn how to create more customers, or how to get repeat clientele. As Eric Ries states in his book The Lean Startup: “The only way to win is to learn faster than anyone else”. (Please note this is an affiliate link).
Luckily, humans have been exploring and learning for a long time, and we’ve even codified a particularly effective method we call the scientific method. The basic idea is you have a question, you make a hypothesis about the answer, you test that hypothesis with an experiment, and then you find out whether you were right or wrong. It’s a methodical trial-and-error that helps you find answers to questions.
In the midst of this questioning and learning, many startups started running experiments called AB tests. While AB testing is incredibly useful, I have seen plenty of misunderstanding about how and when to run a test. I’ve seen people skip steps in the larger scientific method—failing to record a hypothesis, or not collecting the right metrics to prove or disprove the hypothesis—so once the test is over the variant that everyone “likes better” gets merged in. This is an absolute waste of time and money. I’ve also seen people run AB tests when they really didn’t need to, and this is also a waste of time and money! To spare us some of that waste, I’d like to dig into what AB testing is and when it’s appropriate to run a test, including how to trade off some of the costs.
What is an AB test?
First, let’s be clear about what an AB test is: it’s an experiment designed to control for as many variables in the user experience as possible. You split users into 2 or more groups and show them variants of the same experience. Users must remain in the same group for the duration of the test for the results to be valid. These variants run at the same time to control for external factors related to timing, like market forces or your company’s current ad spend.
What do we test and why?
We often have hypotheses about how our changes to the product will affect the metrics we care about. For instance, I might guess that changing the CTA copy will drive more conversions. If I just make the change and then analyze the results, I could get the wrong impression. Imagine that I’m building a finance app, and the market dips right as I release my CTA copy change—I might not be able to tell whether it was my change or the change in the market that affected the conversion rate!
The point of an AB test then is to verify our hypothesis with some degree of certainty. This hypothesis checking—the scientific method we discussed earlier—happens through a process called “induction”, as opposed to “deduction”. For deductive reasoning, we start with a set of premises and work our way to conclusions through logically sound reasoning. But for induction, we don’t have a set of axioms or premises to work from. We only have an observation and our hypothesis about what that means. To prove our hypothesis true, we have to disprove every potential counterargument. The more counterarguments you knock down, the more confident you can be that your hypothesis is correct. (Incidentally, this is the same process we go through when testing code). Running an AB test lets you knock down a whole swath of counterarguments in one go since you’re controlling for so many variables. That makes it a valuable tool for validating the efficacy of product changes.
So should I run a test?
Now that we know what an AB test is and why we’d use it, we can determine whether we should use it. You should run an AB test if you have an unproven hypothesis you want to prove or disprove, and the value of running the test outweighs the cost.
First, how do you know if you have an unproven hypothesis? Imagine someone comes to you and says “Let’s make this change to our onboarding flow because it feels like it will help”. Is there a hypothesis there? Yes, there absolutely is, even if the person making the request hasn’t explicitly laid it out. You should ask them:
- which metrics do you think this change will affect?
- how will the metrics be affected?
- have we run tests like this before?
Hypotheses should be specific and stated in terms of your metrics. This will help you make meaningful impact estimates for new projects, which helps with prioritizing work. It also makes it easy to take learnings from a test and use them in other places. Let’s say in this hypothetical scenario that someone had already tested a similar change at an earlier point in the onboarding flow and saw that it improved a key metric. In that case, you probably don’t need to run another test. The costs of additional code changes, cyclomatic complexity, and the opportunity cost of waiting for the test results to come back are likely higher than the value you’d get from the marginal learnings. But this can only be ascertained if the previous test had a clear hypothesis and therefore a clear result.
As a real-world example, my team tested adding certain components to a landing page. We hypothesized that adding the components would improve clickthrough conversion. In the test, our control group saw the page without the components, while the variant group saw the components. Lo and behold, the variant group had a higher clickthrough rate. From there we decided to make the same changes to a different landing page, but we didn’t run a test the second time since we were already confident the change was beneficial. Running another test at that point would’ve taken more time and more code changes, but likely would not have helped us learn more about our users. We already had a clear hypothesis and clear learnings, so we could extrapolate that this other landing page would benefit from the new components as well. Since we measured the percent change to our clickthrough conversion rate, we were also able to estimate the impact of making these changes to the second landing page, which helped us move the work higher in our backlog.
We can see it’s important to make a hypothesis—both for high-quality learning and also so we don’t try to test the same hypothesis again—but it can be tricky to do well. We’re often good at making hypotheses about things that will happen if we make a change. But we’re not so good at making hypotheses about things that won’t happen. For instance, we’re good at saying “This change will add 5 new users a day”, but not so good at saying “This change won’t reduce the number of new users coming in per day”.
This is an easy trap to fall into, but it can cause real problems. Something as simple as addressing tech debt could change conditions enough to impact your metrics. For example, I once saw an innocuous backend change in the onboarding flow increase latency on one of the first pages. That additional latency, unbeknownst to us, increased dropoff on that page. We didn’t catch it because we didn’t test it, and we didn’t test it because we didn’t think to make a hypothesis about what wouldn’t happen as a result of our change (namely, that the change wouldn’t cause a dip in key metrics). Keep that in mind as you’re planning changes to your code. If there’s doubt about the impact of a change, you may want to run a test.
That being said, there are many changes where there won’t be doubt about the impact. Since we mentioned latency, let’s consider performance improvements: a decent body of literature already tells us that performance improvements increase conversion rate and decrease bounce rate. There’s no new inquiry there, so we can skip the AB test. I know there’s a temptation to run a test so we can post flashy Slack messages with results, or so we can avoid the upfront work of making a researched estimate to help with proper prioritization, but don’t let the urge to post-rationalize a decision lead you to test. Because there’s so much research already out there, you should be able to predict the impact before you do the work. If you’re AB testing to justify it after the fact, you’ve already screwed up.
The costs of an AB test
To balance what was said above, let’s talk about the costs associated with an AB test, and when those costs might outweigh the benefits of testing.
AB testing cost comes in three main forms:
- increased cyclomatic complexity, which increases the odds of introducing a bug and also makes code harder to maintain;
- additional project scope, which increases the time and energy a project takes, thereby increasing opportunity cost;
- complexity inherent in the test itself, especially if you’re running multiple AB tests on the same population, which can give you faulty results
These risks need to be understood and traded off with the value you gain from running a test. If it turns out that the risk outweighs the reward, you might be better off forgoing the test. Let’s talk about those costs in a little more detail.
First, you increase the cyclomatic complexity of your code by adding multiple branches. This is a known contributor to bugs and it also makes code harder to read, understand, and change. This makes future work on the same code riskier. Linters measure and complain about cyclomatic complexity for a reason! If you’re only modifying one code path in one service, maybe this cost is negligible. But if you’re editing multiple places in your service, or even multiple services to get your test to run, consider that you’re adding the risk of a bug as well as additional maintenance costs in multiple places. You now need to keep track of all those locations you modified and you need to be sure to clean them all up once you’re done.
This additional work segues into our second cost nicely, that being that we add scope. Not only do you need to do the additional work of setting up your test, you need to keep track of it and clean it up once you’re done, as mentioned. This small ongoing maintenance can add up for a team and can be a drain on the team’s ability to do new feature work. You need to assess whether that opportunity cost is worth it.
Determining whether the cost is worth it comes down mainly to understanding how much impact your current work has, versus how much impact the work you could be doing has. If you could be doing something more impactful, then consider ways to reduce your test cost so you can move on. You could potentially release the code and monitor your metrics after the fact; sometimes this will still give you the info you need. For instance, if you’re looking at the ratio of page views to clicks on a page (i.e. conversion rate), that ratio still gives you good information regardless of how many people arrive on the page. Maybe you know the total visitor metric is tied to external factors (like market performance), but you’re confident the conversion rate itself is fairly independent. You can still get good information in this case because you’re looking at a ratio, which lets you abstract away the total number of visitors to the page. You’re not perfectly confident in your results, but you’re confident enough, and more importantly, you’re freed up to work on the more impactful work that’s coming next.
This approach (shipping and observing) becomes more feasible if you’re able to roll back easily. Let’s say you see your metrics decrease after you release. In a test situation, you just turn the test flag off and you’re safe. But if you’ve forgone the test flag, you need to actually release a code change. If you were careful to keep your changes contained to one or two commits, a simple git revert can often do the trick. When decisions are easy to change, the cost to make them is low, so sometimes just making the call to release a change can save time and still provide decent learning without the cost that comes with setting up and running a test.
The last cost of running an AB test comes when you’re running many tests at once. If you’re not careful about how your users are segmented, you can end up affecting the results of one test with another test running at the same time. Untangling the web of test dependencies can be costly before the test starts, and very costly if you’re trying to analyze the data afterward. Worst case scenario you end up with unusable results in one or more tests, which makes your effort a waste. In this case, it’s better to skip one of the planned tests than to end up with no usable results.
You should run an AB test if you have an unproven hypothesis you want to test, and the value of the learning outweighs the cost. You should not run a test if you are not verifying a hypothesis, or the cost of testing outweighs the value.
Costs will generally fall into three groups (complexity, scope, and test overlap), and the impact of each group of costs will vary from project to project. You should evaluate them up front and consider alternate strategies if the cost is ultimately higher than the value.