AB Testing refers to a randomized controlled experiment designed to understand how system variants affect metrics. Most major websites today run hundreds or even thousands of AB tests simultaneously, as different product groups seek to optimize for different metrics. In this article, I’ll walk you through what AB testing is and how we can use it in Machine Learning.
How does AB Testing work?
The general process for AB Testing is to randomly divide the user population into two groups, A and B, and show each group a different variant of the system in the analysis (for example, a spam classifier). Evaluating the experiment involves collecting data on the metric to be tested in each group and performing a statistical test to determine if the difference in metric between the two groups is statistically significant.
One of the main challenges of AB testing is determining how much traffic to route through new system A and how much to route through old system B. This problem is a variant of the multi-bandit problem. armed with probability theory, whose solution must strike a balance between exploration and exploitation.
We want to be able to learn as much as possible about the new system by routing more traffic to it, but we don’t want to risk an overall degradation of the metrics, as System A might perform worse than existing System B.
One algorithm that solves this problem is Thomson Sampling, which involves routing each variant an amount of traffic proportional to the likelihood that a better result will be obtained, based on data collected previously. Multi-armed Contextual Bandits take this approach a step further and also incorporate external environmental factors into this decision-making process.
AB Testing in Machine Learning
In the context of machine learning systems, you should always validate and compare new generations of models with existing production models via AB testing. Every time you apply such a test, there must be a good metric. defined that the test seeks to optimize.
For example, such a metric for a spam classifier AB test might be the number of spam emails that end up in users’ inboxes; you can measure this metric through user feedback or sampling and labelling.
AB testing is essential for machine learning systems, as long-lasting model evolutionary updates may not give you the best results you can get. Being able to experiment with new models and empirically determine what gives the best performance gives machine learning systems the flexibility to adapt to the changing landscape of data and algorithms.
However, you should be careful when running AB tests in conflicting environments. The statistical theory behind AB testing assumes that the underlying input distribution is the same between segments A and B. However, devoting even a small fraction of traffic to a new model may get the opponent to change his behaviour.
In this case, the AB test assumption is violated and your stats will be meaningless. Additionally, even though the opponent’s traffic is split between segments, the fact that some of the traffic is now being treated differently can cause the opponent to change their behaviour or even disappear, and the metric you hold on to the heart may not show a statistically significant difference in the AB test even if the new model was effective.
Likewise, if you start blocking 50% of the bad traffic with the new model, the opponent might just double their request rate and your great model will not change your overall stats. I hope you liked this article on AB Testing in Machine Learning. Feel free to ask your valuable questions in the comments section below.