{"id":1064,"date":"2025-01-09T07:04:04","date_gmt":"2025-01-09T07:04:04","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/09\/tldr-bayesian-a-b-testing-falls-short-f8646529a47a\/"},"modified":"2025-01-09T07:04:04","modified_gmt":"2025-01-09T07:04:04","slug":"tldr-bayesian-a-b-testing-falls-short-f8646529a47a","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/09\/tldr-bayesian-a-b-testing-falls-short-f8646529a47a\/","title":{"rendered":"Bayesian A\/B Testing Falls Short"},"content":{"rendered":"<p>    Bayesian A\/B Testing Falls Short<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Why Bayesian A\/B testing can lead to misunderstandings, inflated false positive rates, introduce bias and complicate results<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AjztD1Hucvx1SNKGECZekXA.png?ssl=1\"><figcaption>(Image generated by the author using Midjourney)<\/figcaption><\/figure>\n<p>Over the past decade, I\u2019ve engaged in countless discussions about Bayesian A\/B testing versus Frequentist A\/B testing. In nearly every conversation, I\u2019ve maintained the same viewpoint: there\u2019s a significant disconnect between the industry\u2019s enthusiasm for Bayesian testing and its actual contribution, validity, and effectiveness. While the hype around Bayesian testing may have peaked, it remains widely\u00a0popular.<\/p>\n<p>My first exposure to Bayesian statistics was during my master\u2019s studies, where my thesis focused on Thompson Sampling. Professionally, I encountered Bayesian A\/B testing during my tenure at Wix.com, where I played a key role in transitioning from the classical method to the Bayesian method. My perspective, as described here, has been informed by both my academic background and my professional experience at Wix and beyond, where I\u2019ve helped many companies enhance their A\/B testing capabilities.<\/p>\n<p>When referring to \u201cBayesian A\/B testing\u201d, I\u2019m specifically talking about the methods promoted by <a href=\"https:\/\/vwo.com\/downloads\/VWO_SmartStats_technical_whitepaper.pdf\">VWO<\/a> and similar approaches used in some current experimentation platforms as alternatives to the classic (Frequentist) method. There are other implementations of Bayesian statistics in A\/B testing, such as Thompson sampling in Multi-armed-bandit experiments, which can be highly effective but are rare outside marketing platforms like Google Ads and Facebook\u00a0Ads.<\/p>\n<p>In this post, I\u2019ll explain what Bayesian tests entail, outline the most common arguments in favor of Bayesian tests, and address each argument. I\u2019ll then discuss the major drawbacks of the Bayesian method and, finally, cover when to use Bayesian methods in experiments.<\/p>\n<p>So grab a cup of coffee, and let\u2019s dive\u00a0in.<\/p>\n<p><strong>What Do Bayesian Tests\u00a0Mean?<\/strong><\/p>\n<p>Bayesian statistics and Frequentist statistics differ fundamentally. Bayesian statistics incorporates prior knowledge or beliefs, updating this prior information with new data to produce a posterior distribution. This allows for a dynamic and iterative process of probability assessment. In contrast, Frequentist statistics relies solely on the data at hand, using long-run frequency properties to make inferences without incorporating prior beliefs. Frequentist statistics focuses on the likelihood of observing the data given a null hypothesis and uses concepts like p-values and confidence intervals to make decisions.<\/p>\n<p>In Bayesian A\/B testing, we design the test in a way that after short time, and based on the data gathered so far, we could calculate the probability that the treatment variant (B) is better than the control variant (A), noted as P(B&gt;A| Data). Another metric used is risk, or expected loss, which helps us understand the risk of making a decision based on the data collected.<\/p>\n<p>Bayesian A\/B testing typically involves running a test, computing P(B&gt;A|Data) and\/or the expected loss (Risk), and making a decision based on these metrics. The decision can be arbitrary or involve a stopping rule, such\u00a0as:<\/p>\n<ol>\n<li>The probability B is better than A is larger than X%. For example: P(B&gt;A| Data) &gt;\u00a095%<\/li>\n<li>The expected loss (Risk) is less than Y%. For example: expected loss &lt;\u00a01%<\/li>\n<\/ol>\n<p><strong>Arguments for Bayesian\u00a0Tests<\/strong><\/p>\n<p>Throughout my career, I\u2019ve encountered three common arguments in favor of Bayesian\u00a0tests:<\/p>\n<ol>\n<li>The early stopping argument\u200a\u2014\u200athe ability to stop the experiment whenever you want (or based on a stopping rule), unlike the classic t-test \/ z-test that requires planning your sample size and analyzing the results only once the predefined sample size is reached. This is useful in cases where the sample size is small or when there is a very big effect and you would like to stop the test based on the\u00a0results.<\/li>\n<li>The prior argument\u200a\u2014\u200aThe use of prior knowledge or business knowledge to enrich data and make better decisions.<\/li>\n<li>The language and terminology argument\u200a\u2014\u200abayesian metrics are more intuitive and suited to everyday business language compared to Frequentist metrics like p-value. Thus, \u201cProbability B is better then A\u201d is much more intuitive and well understood compared to \u201cthe probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is true\u201d\u200a\u2014\u200awhich is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/P-value\">p-value definition<\/a>.<\/li>\n<\/ol>\n<p>Let\u2019s tackle each argument one by\u00a0one.<\/p>\n<p><strong>You Can Stop Whenever You\u00a0Want<\/strong><\/p>\n<p>In the online industry, data is collected automatically and often displayed in real-time dashboards that include various statistical metrics. Simple classical tests, like the t-test and z-test, do not permit peeking at the results, requiring a predefined sample size and only allowing analysis once that sample size is\u00a0reached.<\/p>\n<p>Anyone who has ever run an A\/B test knows that this is not practical. The easy accessibility of information makes it hard to ignore, especially when a product manager notices significant results, whether positive or negative, and insists on stopping the experiment to move on to the next task. This highlights the clear need for a method that allows peeking at the data and stopping early. Thus, the argument for early stopping is perhaps the strongest for Bayesian A\/B tests\u200a\u2014\u200aif only it were\u00a0true.<\/p>\n<p>Bayesian statistics, when considered superficially as \u201csubjective understanding incorporating prior beliefs to the data,\u201d allows stopping whenever. However, if you expect guarantees like \u201ccontrolling the false positive rate\u201d (as in the Frequentist approach), this is problematic.<\/p>\n<p>Bayesian A\/B testing is not inherently immune to the pitfalls of peeking at the data. For those looking for a good statistical explanation, please take a look at <a href=\"https:\/\/blog.analytics-toolkit.com\/2017\/bayesian-ab-testing-not-immune-to-optional-stopping-issues\/\">Georgry\u2019s excellent blog post<\/a>. For now, let\u2019s address Greorgry\u2019s point, but from a different perspective:<\/p>\n<p>In the case of two variants, control and treatment, and when the number of users is large enough, the one-tailed p-value is almost identical to the Bayesian probability the control is better than the treatment, noted as P(A&gt;B| Data) =1-P(B&gt;A| Data). In an A\/B test, a low one-tailed p-value and low P(A&gt;B| Data) (which is equivalent to high P(B&gt;A| Data)) indicates that the treatment is better than the control. The fact that these two measures are almost identical means that technically, early stopping based on P(B&gt;A | Data) is equivalent to early stopping based on the p-value failing to maintain the type I error rate (false positive\u00a0rate).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AzksjGf1_1fgXG-bGj6UX4g.png?ssl=1\"><figcaption>Calculations: <a href=\"https:\/\/marketing.dynamicyield.com\/bayesian-calculator\/\">https:\/\/marketing.dynamicyield.com\/bayesian-calculator\/<\/a> AND <a href=\"https:\/\/www.socscistatistics.com\/tests\/ztest\/default2.aspx\">https:\/\/www.socscistatistics.com\/tests\/ztest\/default2.aspx<\/a><\/figcaption><\/figure>\n<p>Although the Bayesian method does not commit to maintaining the false positive rate (aka type I error), practitioners would likely not want to see false \u201csignificant\u201d results frequently. The notion of \u201cstop whenever you want\u201d is usually interpreted by practitioners as \u201cwe\u2019re safe to draw valid conclusions at any point because we\u2019re doing Bayesian analysis\u201d rather than \u201cwe\u2019re safe to draw conclusions at any point because Bayesian A\/B testing doesn\u2019t guarantee to maintain something similar to false positive rate\u201d. We now understand that Bayesian A\/B testing, in the popular way it is practiced, means the\u00a0latter.<\/p>\n<p>Sequential testing in the Frequentist approach, on the other hand, allows for peeking and early stopping while maintaining control over the false positive rate. Various frameworks, such as Group Sequential Testing (GSP) and the Sequential Probability Ratio Test (SPRT), enable this and are widely implemented in experimentation platforms like Optimizely, Statsig, Eppo, and A\/B\u00a0Smartly.<\/p>\n<p>In summary, both Frequentist and Bayesian methods are not immune to the issues of peeking, but sequential testing frameworks can help mitigate these issues while making sure they do not inflate the false positive\u00a0rate.<\/p>\n<p><strong>Use of\u00a0Prior<\/strong><\/p>\n<p>The second argument in favor of Bayesian A\/B testing is the use of prior knowledge. Throughout the web and conversations with practitioners, I\u2019ve encountered comments regarding prior such as \u201cUsing prior allows you to incorporate existing and relevant business knowledge into the experiment and thereby improve performance\u201d. These statements sound very appealing because they play on a very correct sentiment\u200a\u2014\u200ausually using additional data is better. The more, the merrier. But anyone who understands a bit how the concept of priors in Bayesian probability works will understand that the use of priors in A\/B testing is at least risky, and can lead to incorrect results.<\/p>\n<p>The basic idea in Bayesian statistics is to combine any prior knowledge we have, aka prior, with the data to produce posterior distributions\u200a\u2014\u200aknowledge that combines our prior knowledge with the data. Seemingly, there is something here that does not exist in the classical method. We are not just using the data; we are also adding more knowledge and business information that exists in our organization!<\/p>\n<p>In the case of comparing two proportions\u200a\u2014\u200athe meaning of prior is actually very simple. It is simply an addition of a virtual # of success and # of users to the data. Suppose we did such a test, and out of 1000 users in the control group, and we have 100 conversions.<\/p>\n<p>Assuming my prior is \u201c10 successes out of 100 users\u201d, it means that my posterior knowledge is the sum of successes and users of the prior and the data. In our example: 110 \u201cconversions\u201d out of 1100 \u201cusers\u201d. This is not the exact statistical definition, but it captures the idea very\u00a0well.<\/p>\n<p>A prior can be weak (1 success out of 10 users) or strong (1000 successes out of 10000 users for example). Both represent a knowledge that the conversion rate is 10%. In any case, when we accumulate a lot of data, the prior weight naturally decreases.<\/p>\n<p>How should we incorporate prior knowledge in a two proportions A\/B test? There are two\u00a0options:<\/p>\n<ol>\n<li>We incorporate, based on historical data, the general conversion rate in the population and add it to each variant. This is common practice.<\/li>\n<li>We incorporate, based on historical data, which variant, control or treatment, usually show better results and give that variant an advantage based on this knowledge.<\/li>\n<\/ol>\n<p>How will the prior manifest in the first option? Let\u2019s stick to the example of 1000 users in each variant, 100 conversions to control variant and 120 conversions to treatment variant.<\/p>\n<p>Suppose we know that the CVR is 10%, so an appropriate prior could be to add 100 successes and 1000 users to the existing data and then perform a statistical test as if we have 2000 users in each group, 200 conversions in control and 220 conversions in treatment. What\u2019s described here is exactly what happens; it\u2019s not approximately or as if\u200a\u2014\u200athat\u2019s the technical meaning of the prior in the case of two proportions bayesian test (assuming beta prior, for the statisticians reading this article).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Av93XwrHC3uiJIeNqpbem0A.png?ssl=1\"><\/figure>\n<p>A simple calculation shows that using a stronger prior in our example will increase P(A&gt;B| Data), which means less indication for difference between variants\u200a\u2014\u200acompared to the weak prior. That is what happens when you add the same amount of successes and users to each variant. This practice goes against our motivation to stop as early as possible, so why on earth would we want to do such a\u00a0thing?<\/p>\n<p>A common argument is that the Bayesian method is very liberal in choosing a winner, and the priors are a restraining factor. That\u2019s true, the Bayesian method as I represented is very liberal, and priors are a restraining factor. So why not choose a more conservative approach (hmmm hmmm Frequentist) to begin\u00a0with?<\/p>\n<p>Moreover, if that is the argument, then it is clear to everyone that the glorified claim about priors that \u201cadd business information to the experiment\u201d is misleading. If the business information is just a restraining factor, then the idea of using strong prior does not seem appealing at\u00a0all.<\/p>\n<p>The second option for incorporating a prior, giving one version an advantage over the other version based on historical data, is even worse. Why would anyone want to do this? Why should one experiment be influenced by the successes or failures of previous experiments? Each experiment should be a clean slate, a new opportunity to try something new without bias. Adding 200 successes to one version and 100 to the other sounds absurd and unreasonable in any\u00a0way.<\/p>\n<p><strong>Language and Terminology<\/strong><\/p>\n<p>The third argument in favor of Bayesian A\/B testing is the more intuitive language and terminology. A\/B testing results are often consumed by people without strong statistical backgrounds. Frequentist metrics like p-values and confidence intervals can be unintuitive and misunderstood, even by statisticians. Many articles have been written about people\u2019s misunderstanding of these metrics, even people with a background in statistics. I admit that it was only a considerable time after my master\u2019s degree in statistics that I understood the exact definition of a classical CI. There is no doubt that this is a real pain point and an important one.<\/p>\n<p>If you ask someone without a background in statistics to compare two versions with partial performance data for each version and ask them to formulate a question, they are likely to ask, \u201cWhat is the probability that this version is better than the other version?\u201d The same is true for confidence intervals. Most likely, when you explain the definition of a Frequentist confidence interval to someone, they will understand it in a Bayesian\u00a0way.<\/p>\n<p>This argument is actually true. I agree that Bayesian statistical metrics are much more intuitive to the common practitioner, and I agree that it is preferred that the statistical language will be as simple as possible and well understood, since A\/B testing is mostly being conducted and consumed by non-statisticians. However, I don\u2019t think it\u2019s a disaster that practitioners don\u2019t fully understand the statistical terms and results. Most of them are thinking in terms of \u201cwinning\u201d and \u201closing\u201d and it\u2019s\u00a0okay.<\/p>\n<p>I recall, when I was at Wix, showing our new Bayesian A\/B testing dashboard to a product manager as part of a usability test, to learn how he reads it and what he understands. His approach was very simple\u200a\u2014\u200asearching for \u201cgreens\u201d and \u201creds\u201d KPIs and ignoring the \u201cgrays\u201d KPIs. He didn\u2019t really care if it was a p-value or probability B is better than A, a confidence interval or a credible interval. I bet that if he knew, it would rarely change his decision about the\u00a0test.<\/p>\n<p><strong>Major Drawbacks of the Bayesian\u00a0Method<\/strong><\/p>\n<p>So far, we have discussed the alleged advantages of using the popular Bayesian method for A\/B testing and why some of them are not correct or meaningful enough. There are also very considerable disadvantages to using the Bayesian\u00a0method:<\/p>\n<ol>\n<li>The lack of maximum sample\u00a0size<\/li>\n<li>The lack of guidelines and framework to make a decision regarding the test when the results are inconclusive.<\/li>\n<\/ol>\n<p>These drawbacks are significant, especially since most experiments do not show a significant effect.<\/p>\n<p>Let\u2019s assume we run an experiment which does not affect the KPI we are interested in at all. In most cases, the data will indicate indecision, and we will not be sure what to do next. Should we continue the experiment and collect more data? Or go with the more probable variant even if the results are not conclusive?<\/p>\n<p>One can argue that predefined sample size is a limiting factor, but it also provides an important framework for decision-making. We decide upon a sample size, and we know that we will be able, with high probability (known as statistical power), detect a predefined effect size. If we are smart enough, we will use a sequential testing method that will allow us to stop before we reach the maximum predefined sample\u00a0size.<\/p>\n<p>It is true that when using one of the Bayesian stopping rules mentioned before, the test will eventually end even if there is no effect. For example, the risk will gradually, and slowly, decrease and eventually will reach the predefined threshold. The problem is it will take a very long time when there is no difference between the variants. So long that in reality practitioners will likely won\u2019t have the patience to wait. They will stop the experiment once they feel there is no point in continuing.<\/p>\n<p><strong>When to Use Bayesian Methods in Experiments<\/strong><\/p>\n<p>In Multi-Armed Bandit (MAB) experiments, Bayesian statistics flourish and are considered best practice. In these types of experiments, there are usually several variants (for example several ads creative) and we want to quickly decide which ads are performing the best. When the experiment begins, users are allocated equally to all variants, but after some data is gathered, the allocation changes and more users are allocated to the better performing variant (ad). Eventually, (almost) all users are allocated to the best performing variant\u00a0(ad).<\/p>\n<p>I also came across an interesting Bayesian A\/B testing framework in an article published by <a href=\"https:\/\/arxiv.org\/abs\/1602.05549\">Microsoft<\/a>, but I never met any organization using the suggested methodology, and it still lacks a maximum sample size which should be very important to practitioners.<\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p>While Bayesian A\/B testing offers a more intuitive framework and the ability to incorporate prior knowledge, it falls short in critical areas. The promises of early stopping and better decision-making are not inherently guaranteed by Bayesian methods and can lead to misunderstandings and inflated false positive rates if not carefully managed. Additionally, the use of priors can introduce bias and complicate results rather than clarify them. The Frequentist approach, with its structured methodology and sequential testing options, provides more reliable and transparent results, especially in environments where rigorous decision-making is essential.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=f8646529a47a\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/tldr-bayesian-a-b-testing-falls-short-f8646529a47a\">Bayesian A\/B Testing Falls Short<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Allon Korem | CEO, Bell Statistics<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Ftldr-bayesian-a-b-testing-falls-short-f8646529a47a\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Bayesian A\/B Testing Falls Short Why Bayesian A\/B testing can lead to misunderstandings, inflated false positive rates, introduce bias and complicate results (Image generated by the author using Midjourney) Over the past decade, I\u2019ve engaged in countless discussions about Bayesian A\/B testing versus Frequentist A\/B testing. In nearly every conversation, I\u2019ve maintained the same viewpoint: [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[804,62,177,83,238,92],"tags":[557,108,1184],"class_list":["post-1064","post","type-post","status-publish","format-standard","hentry","category-ab-testing","category-aimldsaimlds","category-bayesian-statistics","category-data-science","category-statistics","category-thoughts-and-theory","tag-bayesian","tag-my","tag-testing"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1064"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1064"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1064\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1064"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1064"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1064"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}