Cover photo for Eric O'Neil

Stop Trying to Quantify the Dollar Value of A/B Testing. The Only Thing That Matters is Lessons-Learned.

Eric O'Neil
I recently wrapped up an A/B test on a mobile application that brings in $6B+ in annual revenue. It was a failure. Our hypothesis was wrong, and we ended up funneling real-world consumers to a modified user experience that led them to spend less money than they otherwise would have. Whoops.

The difference was marginal -- less than a dollar per user -- but multiply that over millions of site visits, and it starts to add up, quickly. The good news is that, when it comes to A/B testing, there are no failures. In this case, learning what consumers don't want is just as valuable as learning what they do want. Here's what my team and I learned:
  • We were able to validate our current user experience, and confirm past experiments that had led us to that point
  • We were able to confidently reject our hypothesis, allowing us to spend our resources elsewhere
  • We were able to use the above to design our next experiments
  • We got one step closer to optimizing the app's landing page
Perhaps unsurprisingly, given the current economic climate, with budgets tightening and spending under a great microscope, I was asked to quantify the cost -- and/or revenue -- from our team's ongoing A/B testing. And that's fine, and relatively easy to do. Because our B variant was a money loser, the "cost" to learn the aforementioned lessons was in the six-figures.

But that misses the point. The main benefit in quantifying the dollar value of an experimentation campaign is to be able to justify that campaign to non-technical management. The real benefit -- and much harder to quantify, though it can be done -- is in the lessons-learned, and the decisions made based on those lessons. Military jargon alert: A/B testing, and the decisions made from the outcomes, is your classic intel-operations cycle. Translated into pop-economic terms: Incremental updates to your product, made from A/B testing, creates a flywheel effect that grows your business.

What you should be focused on is capturing a repository of past, present and future (planned) experiments, tied to the lessons learned and the decisions points made from each.

Because there's no finish line when it comes to optimizing your business and products, experimentation never ends. Instead, you should be designing tests along two parallel tracks:
  1. "Bite-Size" changes: Here is where you're going to create smaller, incremental tweaks to your user experience. Your goal here is to find the "optimal" experience, as defined by your North Star Metric. Because technology evolves and human beings get tired, you tend to see a decay to any lift you get from your A/B testing -- part of why you should be experimenting in the first place.
  2. "Big Swings" changes: Concurrently with your bite-sized campaign, you should be funneling about 5% of your user base to radical re-design of your user experience. This is a low-risk way to potentially uncover big lifts, big insights, and position you for when it's time for a big upgrade. Big swings are also the preferred experimentation technique for products with lower traffic volumes, where it's harder to reach the desired confidence interval (see postscript) with small tweaks. Here's a great example of how Groove took a big swing and doubled their conversion rate.
One interpretation of my "failed" test is that we accidentally siphoned consumers into an experience that cost our company money. The correct interpretation is that the company paid for insights into their users that they couldn't get anywhere else. Insights that allow us to keep innovating, and that will continue to drive engagement and revenue well into the future. 

So, while it's not wrong to quantify the dollar value of your A/B testing, understand that it's only a sideshow. The real value is in driving that flywheel, fueling your decision-making with data, and then using the decisions you've made to design the next test. And how you capture that is what really matters.
Postscript: "Failure" A/B tests exist -- when they're poorly designed and lead to data that falls below the 95% confidence interval (your undergrad or MBA Statistics class coming back to haunt you). In that case, because there is greater than a 1/20 likelihood you arrived at the final results by chance, you cannot draw any conclusions from the results. Some companies are willing to accept a confidence level as low as 80%, but the industry standard is 95%, with a t-value of 1.96.