Over the past year there has been considerable buzz in the tech industry about the application and usability of AI tools and technologies. According to a survey reported by FactSet® in March 2024, in the prior year, approximately 30% of Fortune 500 company CEOs mentioned “AI” in their earnings calls with investors and financial analysts, representing an increase of about three to four times over the prior year (Figure 1).
Because AI tools offer improvements in productivity, accuracy, and efficiency, there has also been a considerable push for their fast adoption. Interestingly, CEOs who did reference AI in their earnings calls have seen a generally better stock performance than those who did not (Figure 2). While there could certainly be some selection bias in this example (i.e. CEOs who mention AI are likely to be more tech-focused, thus capitalizing on the AI boom, whereas those who did not may be more traditional and/or late adopters), it does show that AI is being adopted quickly across various domains.
As we might expect, AI use has been tremendously high in some technical areas, such as software development. A blog post( June 13, 2023), by Inbal Shani of GitHub, a code collaboration platform, noted that after surveying 500 U.S. companies with 1,000+ employees, they found that approximately 92% of developers reported using AI tools regularly, by themselves and/or their team .
One domain that has so far been slow to adopt AI is Data Science, specifically Product Data Science, which involves a lot of product testing (also called A/B testing). In a recent survey by 365datascience.com, reported by Sophie Magnet (April 30, 2024) only about 21% of respondents listed AI skills as necessary to be a Data Scientist.
If more than 75% of Data Scientists are not using AI, can we conclude that there are less useful applications for advanced technologies in the field of data science? Not necessarily. An analysis of a typical Product Data Science project will help illustrate how AI can augment the skillset of a Data Scientist in each phase of a product change process, from Data Gathering through Experimenting and Monitoring Rollout.
Before the Test: Data Gathering
Let’s first quickly understand what Data Scientists refer to as A/B testing, also known as Product Testing. A/B testing is a method of comparing two versions of a webpage, app feature, or product, to determine which one performs better in terms of a specific goal, such as click-through rates, conversion rates, or user engagement. In an A/B test, a population is randomly split into two groups: one group (the control group) is shown the original version (A), while the other group (the experimental group) is shown a modified version (B). By analyzing the performance of both versions using statistical methods, businesses can make data-driven decisions to optimize their products and user experiences, thereby improving overall effectiveness and achieving desired outcomes more efficiently.
The first step for an A/B experimentation setup from a Data Science perspective is understanding the metrics that one is trying to improve, as well as how long it will take to improve them. When talking about product changes, most experiments are trying to change some form or variation of either (or both) the Conversion Rate, defined as the percentage of people purchasing your service/product from the intended audience you marketed/showed the product to, or the Sales/User, which is simply total money spent per converted user. Together, these metrics capture how much impact your product changes are having on customer behavior.
When planning an experiment, you must be able to see the general trend of these metrics, so you can estimate how “noisy” the data is, how consistent the trends are, and how long it might take you to run the tests. One of the ways AI can be valuable here, especially when dealing with multiple metrics—which involves lots of data pulls—is in token searches, as in this example from a tool called ThoughtSpot, which allows you to simply write the required query in a text format, and it returns the Data trends in charts. For example, as you enter the text “Daily Sales last 3 weeks,” the tool tokenizes each word and produces a related result (Figure 3 and Figure 4).
If the daily data is too noisy, meaning it fluctuates too much, you can adjust the search to “Weekly sales last 2 months” and immediately get the relevant chart (Figure 4).
This data, when combined with a confidence level and Minimal Detectable Effects (MDEs), can give you runtime estimates for the experiment. Figure 5 illustrates an example from an A- assisted notebook tool called Hex, which shows runtimes at various MDEs for both Conversion and Sales/user for an A/B test at 0.1 Alpha.
This AI use case saves hours of manual time spent in data pulls and requirement gathering, which involves pulling lengthy SQL queries, and also allows stakeholders to efficiently self-serve, as with AI no coding skills are needed to access information.
During the Test: Experiment Monitoring
AI can also be helpful when a test is actually running. A common use case in A/B testing occurs when the experiment is still live with non-statistically significant results, but the business wants to decide whether to go ahead and launch the product change without delay. In such situations, AI expedites the process with an advanced A/B testing technique called Multi-Armed Bandit (MAB).
Imagine sitting in a casino with multiple slot machines in front of you, where there is no information on which machine is the best (or luckiest) for you, and your objective is to maximize your earnings. This would mean trading off the time you have between using the best machine, and enough exploration of other machines to understand their expected payoffs.
Similarly, in A/B testing, you want to find the best performing variant and send maximum traffic to it, but you still need to spend enough time (or site traffic) on the other variants to know that they are not the best variants. Manually doing this entails monitoring all variants for a few days, and allocating a higher percentage to the best performing variant, and vice versa, and redoing this exercise until you get sufficient confidence on a variant, and then sending all traffic to it.
An AI-assisted experimentation tool can monitor in real time which variant is doing best, along with its likelihood of actually being best. As a variant’s likelihood of being best increases, so does its traffic allocation, and as soon as that likelihood crosses a threshold, all traffic is sent to it. Using an AI for multi-armed bandit tests can reduce test duration by approximately 40% to 50%, and so it is increasingly becoming common practice for small firms that have limited traffic and need to move quickly. In Figure 6, all four variants start at 25% traffic each, but future allocation varies according to their performance. If we assume color as an indicator of performance (green best, red worst) we can see how the allocation has changed for all the variants, giving the maximum traffic to the best performing variant, thereby maximizing our return.
Figure 7 illustrates an example of revenue impact using MAB versus traditional A/B testing methods. As you are mostly choosing best performing variants, the revenue impact is maximized.
After the Test: Monitoring the Rollout
In the third phase, imagine that you ran a successful test and rolled out the winning variant as a product update to all your users. The work of a product data scientist isn’t finished yet, as one still has to monitor the long-term impact of a product change. It is one thing to measure changes to a certain metric in a controlled experiment, it’s another to assess it when it is live for all of your users. While as a simple, albeit less accurate, solution one can monitor the relevant metrics over time to see how they are trending, a more scientific and reliable way is to measure the casualty, or the causal impact of a change on a metric.
AI tools can assist in this monitoring, as in most cases they have inbuilt libraries which contain modelling packages, such as Prophet (developed by Meta) and CausalImpact (developed by Google), and can intelligently measure how causal is one variable to a change in Conversion Rate or change in Sales/User metric. The variable in Figure 8 is rollout of a product change, which can be encoded as the percentage of audience exposed to that product change.
In Figure 8 we can see the output from an AI-assisted notebook tool called Hex, which can run such analyses easily with inbuilt Python and R libraries.
As we see in these illustrations, the first chart in Figure 8 plots a counter-factual of our metric (Sales per User in our example) and plots how it would have been, had the product change not happened. The second and third charts show the individual and overall impacts of the change. This visually depicts that there has been an approximately $15 Sales per User increase due to our product change. Figure 9 presents the statistics, highlighting the likelihood of the change in the Sales/User metric from the intervention (product rollout). In this case we see that it is quite significant (99.9%).
The case for AI in Data Science
These three use case examples demonstrate how AI tools can be valuable to helping data scientists by running A/B tests and rolling out product changes before the test starts, during the test, and after the test has concluded. In all the examples we see evidence of AI’s advantages: because it can understand plain English and often does not need complicated coding, it saves time as it learns from repeated tasks and understands what the user wants. Additionally, AI’s interoperability with statistics makes it very well-versed with most experimentation tools, and it can thus help in calling tests and monitoring the impact. Beyond the examples discussed here, there are many other use cases of AI in A/B testing, such as Outlier detection and removal, Spillover and Contamination detection, Lookalike modelling, and more. Data scientists will find that there are worthwhile applications for AI in the field, and that implementing innovative AI tools in daily experimentation needs is a proven pathway to improving efficiency and accuracy.
About the Author
Abhishek Chaurasiya is a Senior Data Scientist with more than a decade of experience at Fortune 500 companies and successful start-ups, including eBay, Amazon, and DoorDash. With expertise in marketing science, product building, and consumer analytics, he is responsible for driving business growth with innovative product improvements, and stress testing product changes to analyze impacts, providing evidence for current and future product development.