{"id":1196,"date":"2025-01-15T07:02:34","date_gmt":"2025-01-15T07:02:34","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/15\/scale-experiment-decision-making-with-programmatic-decision-rules-a13e2b392462\/"},"modified":"2025-01-15T07:02:34","modified_gmt":"2025-01-15T07:02:34","slug":"scale-experiment-decision-making-with-programmatic-decision-rules-a13e2b392462","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/15\/scale-experiment-decision-making-with-programmatic-decision-rules-a13e2b392462\/","title":{"rendered":"Scale Experiment Decision-Making with Programmatic Decision Rules"},"content":{"rendered":"<p>    Scale Experiment Decision-Making with Programmatic Decision Rules<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Decide what to do with experiment results in\u00a0code<\/h4>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*NPrRTsC_9px8lTzN\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@cytonn_photography?utm_source=medium&amp;utm_medium=referral\">Cytonn Photography<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>The experiment lifecycle is like the human lifecycle. First, a person or idea is born, then it develops, then it is tested, then its test ends, and then the Gods (or Product Managers) decide its\u00a0worth.<\/p>\n<p>But a lot of things happen during a life or an experiment. Sometimes, a person or idea is good in one way but bad in another. How are the Gods supposed to decide? They have to make some tradeoffs. There\u2019s no avoiding\u00a0it.<\/p>\n<p>The key is to make these tradeoffs before the experiment and before we see the results. We do not want to decide on the rules based on our pre-existing biases about which ideas deserve to go to heaven (err\u2026 launch\u200a\u2014\u200aI think I\u2019ve stretched the metaphor far enough). We want to write our scripture (okay, one more) before the experiment starts.<\/p>\n<p>The point of this blog is to propose that we should write how we will make decisions explicitly\u2014not in English, which permits vague language, e.g., \u201cwe\u2019ll consider the effect on engagement as well, balancing against revenue\u201d and similar wishy-washy, unquantified statements\u200a\u2014\u200abut <strong>in\u00a0code<\/strong>.<\/p>\n<p>I\u2019m proposing an \u201cAnalysis Contract,\u201d which enforces how we will make decisions.<\/p>\n<p>A contract is a function in your favorite programming language. The contract takes the \u201cbasic results\u201d of an experiment as arguments. Determining which basic results matter for decision-making is part of defining the contract. Usually, in an experiment, the basic results are treatment effects, the standard errors of treatment effects, and configuration parameters like the number of <a href=\"https:\/\/zachlog.com\/peeking-not-considered-harmful-0ed9c02aaf28\">peeks<\/a>. Given these results, the contract returns an arm or a variant of the experiment as the variant that will launch. For example, it would return either \u2018A\u2019 or \u2018B\u2019 in a standard A\/B\u00a0test.<\/p>\n<p>It might look something like\u00a0this:<\/p>\n<pre>int <br>analysis_contract(double te1, double te1_se, ....)<br>{<br>  if ((te1\/se1 &lt; 1.96) &amp;&amp; (...conditions...))<br>    return 0 \/* for variant 0 *\/<br>  if (...conditions...)<br>    return 1 \/* for variant 1 *\/<br><br>  \/* and so on *\/<br>}<\/pre>\n<p>The Experimentation Platform would then associate the contract with the particular experiment. When the experiment ends, the platform processes the contract and ships the winning variant according to the rules specified in the contract.<\/p>\n<p>I\u2019ll add the caveat here that this is an <em>idea<\/em><strong>. <\/strong>It\u2019s<strong> <\/strong>not a story about a technique I\u2019ve seen implemented in practice, so there may be practical issues with various details that would be ironed out in a real-world deployment. I think Analysis Contracts would mitigate the problem of ad-hoc decision-making and force us to think deeply about and pre-register how we will deal with the most common scenario in experimentation: effects that we thought we would move a lot are insignificant.<\/p>\n<p>By using Analysis Contracts, we\u00a0can\u2026<\/p>\n<h3>Make decisions upfront<\/h3>\n<p>We do not want to change how we make decisions because of the particular dataset our experiment happened to generate.<\/p>\n<p>There\u2019s no (good) reason why we should wait until after the experiment to say whether we would ship in Scenario X. We should be able to say it before the experiment. If we are unwilling to, it suggests that we are relying on something else outside the data and the experiment results. That information might be useful, but information that doesn\u2019t depend on the experiment results was available before the experiment. Why didn\u2019t we commit to using it\u00a0then?<\/p>\n<p>Statistical inference is based on a model of behavior. In that model, we know exactly how we would make decisions\u200a\u2014\u200aif only we knew certain parameters. We gather data to estimate those parameters and then decide what to do based on our estimates. Not specifying our decision function breaks this model, and many of the statistical properties we take for granted are just not true if we change how we call an experiment based on the data we\u00a0see.<\/p>\n<p>We might say: \u201cWe promise not to make decisions this way.\u201d But then, after the experiment, the results aren\u2019t very clear. A lot of things are insignificant. So, we cut the data in a million ways, find a few \u201csignificant\u201d results, and tell a story from them. It\u2019s hard to keep our promises.<\/p>\n<p>The cure isn\u2019t to make a promise we can\u2019t keep. The cure is to make a promise the system won\u2019t let us (quietly) break.<\/p>\n<h3>Be consistent, clear, and precise about how we make decisions<\/h3>\n<p>English is a vague language, and writing our guidelines in it leaves a lot of room for interpretation. Code forces us to decide what we will do explicitly and, to say, quantitatively, e.g., how much revenue we will give up in the short run to improve our subscription product in the long run, for\u00a0example.<\/p>\n<p>Code improves communication enormously because I don\u2019t have to interpret what you mean. I can plug in different results and see what decisions you would have made if the results had differed. This can be incredibly useful for retrospective analysis of past experiments as well. Because we have an actual function mapping to decisions, we can run various simulations, bootstraps, etc, and re-decide the experiment based on that\u00a0data.<\/p>\n<h3>But what if I disagree with the Analysis Contract\u2019s decision?<\/h3>\n<p>One of the primary objections to Analysis Contracts is that after the experiment, we might decide we had the wrong decision function. Usually, the problem is that we didn\u2019t realize what the experiment would do to metric Y, and our contract ignores\u00a0it.<\/p>\n<p>Given that, there are two roads to go\u00a0down:<\/p>\n<ol>\n<li>If we have 1000 metrics and the true effect of an experiment on each metric is 0, some metrics will likely have large magnitude effects. One solution is to go with the Analysis Contract this time and remember to consider the metric next time in the contract. Over time, our contract will evolve to better represent our true goals. We shouldn\u2019t put too much weight on what happens to the 20th most important metric. It could just be\u00a0noise.<\/li>\n<li>If the effect is truly outsized and we can\u2019t get comfortable with ignoring it, the other solution is to override the contract, making sure to log somewhere prominent that this happened. Then, update the contract because we clearly care a lot about this metric. Over time, the number of times we override should be logged as a KPI of our experimentation system. As we get the decision-making function closer and closer to the best representation of our values, we should stop overriding. This can be a good way to monitor how much ad-hoc, nonstatistical decision-making goes on. If we frequently override the contract, then we know the contract doesn\u2019t mean much, and we are not following good statistical practices. It\u2019s built-in accountability, and it creates a cost to overriding the contract.<\/li>\n<\/ol>\n<h3>Contracts as Predicates<\/h3>\n<p>Contracts do not need to be fully flexible code (there are probably security issues with allowing that to be specified directly into an Experimentation Platform, even if it\u2019s conceptually nice). But we can have a system that enables experimenters to specify predicates, i.e., IF TStat(Revenue) \u2264 1.96 AND Tstat(Engagement) &gt; 1.96 THEN X, etc. We can expose standard comparison operations alongside Tstat\u2019s and effect magnitudes and specify decisions that\u00a0way.<\/p>\n<p>Thanks for reading! Does your org use anything similar to an Analysis Contract? I think it\u2019s a great solution to a tricky human problem in experimentation, but I\u2019d love to hear anyone\u2019s real-world experience with a more automated approach to experiment decision-making.<\/p>\n<p>Zach<\/p>\n<p>Connect at LinkedIn: <a href=\"https:\/\/linkedin.com\/in\/zlflynn\">https:\/\/linkedin.com\/in\/zlflynn<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=a13e2b392462\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/scale-experiment-decision-making-with-programmatic-decision-rules-a13e2b392462\">Scale Experiment Decision-Making with Programmatic Decision Rules<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Zach Flynn<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fscale-experiment-decision-making-with-programmatic-decision-rules-a13e2b392462\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scale Experiment Decision-Making with Programmatic Decision Rules Decide what to do with experiment results in\u00a0code Photo by Cytonn Photography on\u00a0Unsplash The experiment lifecycle is like the human lifecycle. First, a person or idea is born, then it develops, then it is tested, then its test ends, and then the Gods (or Product Managers) decide its\u00a0worth. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,211,83,312,1300,238],"tags":[1301,1060,349],"class_list":["post-1196","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-analysis","category-data-science","category-decision-making","category-experience-design","category-statistics","tag-contract","tag-experiment","tag-results"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1196"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1196"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1196\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1196"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1196"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1196"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}