{"id":3871,"date":"2025-05-16T07:02:27","date_gmt":"2025-05-16T07:02:27","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/16\/how-to-build-a-benchmark-for-your-models\/"},"modified":"2025-05-16T07:02:27","modified_gmt":"2025-05-16T07:02:27","slug":"how-to-build-a-benchmark-for-your-models","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/16\/how-to-build-a-benchmark-for-your-models\/","title":{"rendered":"How To Build a Benchmark for Your Models"},"content":{"rendered":"<p>    How To Build a Benchmark for Your Models<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">I\u2019ve <mdspan datatext=\"el1747339990606\" class=\"mdspan-comment\">been working as a data<\/mdspan> science consultant for the past three years, and I\u2019ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with:<\/p>\n<p class=\"wp-block-paragraph\"><strong>They rarely have a clear idea of the project objective.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain.<\/p>\n<p class=\"wp-block-paragraph\">But let\u2019s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em><strong>I want to classify my customers into two groups according to their probability to churn: \u201chigh likelihood to churn\u201d and \u201clow likelihood to churn\u201d<\/strong><\/em><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Well, now what? Easy, let\u2019s start building some models!<\/p>\n<p class=\"wp-block-paragraph\"><strong>Wrong!<\/strong><\/p>\n<p class=\"wp-block-paragraph\">If having a clear objective is rare, having a reliable\u00a0<strong>benchmark<\/strong>\u00a0is even rarer. <\/p>\n<p class=\"wp-block-paragraph\">In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a\u00a0<strong>set of benchmarks<\/strong>\u00a0with the client.<\/p>\n<p class=\"wp-block-paragraph\">In this blog post, I\u2019ll explain:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>What a benchmark is,<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>Why it is important to have a benchmark,<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>How I would build one using an example scenario and<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>Some potential drawbacks to keep in mind<\/strong><\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">What is a benchmark?<\/h2>\n<p class=\"wp-block-paragraph\">A\u00a0<strong>benchmark<\/strong>\u00a0is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared.<\/p>\n<p class=\"wp-block-paragraph\">A benchmark needs two key components to be considered complete:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>A set of metrics<\/strong>\u00a0to evaluate the performance<\/li>\n<li class=\"wp-block-list-item\">\n<strong>A set of simple models<\/strong>\u00a0to use as baselines<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked.<\/p>\n<p class=\"wp-block-paragraph\">It is essential to understand that this baseline shouldn\u2019t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case.<\/p>\n<p class=\"wp-block-paragraph\">If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Why building a benchmark is important<\/h2>\n<p class=\"wp-block-paragraph\">Now that we\u2019ve defined what a benchmark is, let\u2019s dive into why I believe it\u2019s worth spending an extra project week on the development of a strong benchmark.<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Without a Benchmark you\u2019re aiming for perfection<\/strong>\u00a0\u2014 If you are working without a clear reference point any result will lose meaning.\u00a0<em>\u201cMy model has a MAE of 30.000\u201d<\/em>\u00a0Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a\u00a0<strong>baseline<\/strong>, you can measure both\u00a0<strong>performance<\/strong>\u00a0and\u00a0<strong>improvement<\/strong>.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Improves Communicating with Clients<\/strong>\u00a0\u2014 Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Helps in Model Selection<\/strong>\u00a0\u2014 A benchmark gives a\u00a0<strong>starting point<\/strong>\u00a0to compare multiple models fairly. Without it, you might waste time testing models that aren\u2019t worth considering.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Model Drift Detection and Monitoring<\/strong>\u00a0\u2014 Models can\u00a0<strong>degrade<\/strong>\u00a0over time. By having a benchmark you might be able to intercept\u00a0<strong>drifts early<\/strong>\u00a0by comparing new model outputs against past benchmarks and baselines.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Consistency Between Different Datasets<\/strong>\u00a0\u2014 Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">With a clear benchmark, every step in the model development will provide\u00a0<strong>immediate feedback<\/strong>, making the whole process more\u00a0<strong>intentional and data-driven<\/strong>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">How I would build a benchmark<\/h2>\n<p class=\"wp-block-paragraph\">I hope I\u2019ve convinced you of the importance of having a benchmark. Now, let\u2019s actually build one.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s start from the business question we presented at the very beginning of this blog post:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong><em>I want to classify my customers into two groups according to their probability to churn: \u201chigh likelihood to churn\u201d and \u201clow likelihood to churn\u201d<\/em><\/strong><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">For simplicity, I\u2019ll assume\u00a0<strong>no additional business constraints,<\/strong>\u00a0but in real-world scenarios, constraints often exist.<\/p>\n<p class=\"wp-block-paragraph\">For this example, I am using\u00a0<a rel=\"noreferrer noopener\" href=\"https:\/\/www.kaggle.com\/datasets\/shubh0799\/churn-modelling\" target=\"_blank\"><strong><em>this dataset<\/em><\/strong><\/a>\u00a0(<a rel=\"noreferrer noopener\" href=\"https:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/\" target=\"_blank\">CC0: Public Domain<\/a>). The data contains some attributes from a company\u2019s customer base (e.g., age, sex, number of products, \u2026) along with their churn status.<\/p>\n<p class=\"wp-block-paragraph\">Now that we have something to work on let\u2019s build the benchmark:<\/p>\n<h3 class=\"wp-block-heading\">1. Defining the metrics<\/h3>\n<p class=\"wp-block-paragraph\">We are dealing with a churn use case, in particular, this is a\u00a0<strong>binary classification problem<\/strong>. Thus the main metrics that we could use are:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Precision \u2014<\/strong>\u00a0Percentage of correctly predicted churners among all predicted churners<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Recall \u2014<\/strong>\u00a0Percentage of actual churners correctly identified<\/li>\n<li class=\"wp-block-list-item\">\n<strong>F1 score \u2014<\/strong>\u00a0Balances precision and recall<\/li>\n<li class=\"wp-block-list-item\"><strong>True Positives, False Positives, True Negative and False Negatives<\/strong><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These are some of the \u201csimple\u201d metrics that could be used to evaluate the output of a model.<\/p>\n<p class=\"wp-block-paragraph\"><strong>However<\/strong>, it is not an exhaustive list, standard metrics aren\u2019t always enough. In many use cases, it might be useful to\u00a0<strong>build custom metrics<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s assume that in our business case the\u00a0<strong>customers labeled as \u201chigh likelihood to churn\u201d are offered a discount.<\/strong>\u00a0This creates:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">A\u00a0<strong>cost<\/strong>\u00a0($250) when offering the discount to a non-churning customer<\/li>\n<li class=\"wp-block-list-item\">A\u00a0<strong>profit<\/strong>\u00a0($1000) when retaining a churning customer<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Following on this definition we can build a custom metric that will be crucial in our scenario:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Defining the business case-specific reference metric\ndef financial_gain(y_true, y_pred):  \n    loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  \n    gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  \n    return gain_from_tp - loss_from_fp<\/code><\/pre>\n<p class=\"wp-block-paragraph\">When you are building\u00a0<strong>business-driven metrics<\/strong>\u00a0these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.<\/p>\n<h3 class=\"wp-block-heading\">2. Defining the benchmarks<\/h3>\n<p class=\"wp-block-paragraph\">Now that we\u2019ve defined our metrics, we can define a set of baseline models to be used as a reference.<\/p>\n<p class=\"wp-block-paragraph\">In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>If I had 15 minutes, how would I implement this model?<\/strong><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">In later phases of the model, you can add mode baseline models as the project proceeds.<\/p>\n<p class=\"wp-block-paragraph\">In this case, I will use the following models:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Random Model \u2014<\/strong>\u00a0Assigns labels randomly<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Majority Model \u2014<\/strong>\u00a0Always predicts the most frequent class<\/li>\n<li class=\"wp-block-list-item\"><strong>Simple XGB<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>Simple KNN<\/strong><\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np  \nimport xgboost as xgb  \nfrom sklearn.neighbors import KNeighborsClassifier  \n  \nclass BinaryMean():  \n    @staticmethod  \n    def run_benchmark(df_train, df_test):  \n        np.random.seed(21)  \n        return np.random.choice(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()])  \n      \nclass SimpleXbg():  \n    @staticmethod  \n    def run_benchmark(df_train, df_test):  \n        model = xgb.XGBClassifier()  \n        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  \n        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))  \n      \nclass MajorityClass():  \n    @staticmethod  \n    def run_benchmark(df_train, df_test):  \n        majority_class = df_train['y'].mode()[0]  \n        return np.full(len(df_test), majority_class)  \n  \nclass SimpleKNN():  \n    @staticmethod  \n    def run_benchmark(df_train, df_test):  \n        model = KNeighborsClassifier()  \n        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  \n        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Again, as in the case of the metrics, we can build\u00a0<strong>custom benchmarks<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s assume that in our business case the\u00a0<strong>the marketing team contacts every client who\u2019s<\/strong>:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Over 50 y\/o<\/strong>\u00a0and<\/li>\n<li class=\"wp-block-list-item\">That is\u00a0<strong>not active anymore<\/strong>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Following this rule we can build this model:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Defining the business case-specific benchmark\nclass BusinessBenchmark():  \n    @staticmethod  \n    def run_benchmark(df_train, df_test):  \n        df = df_test.copy()  \n        df.loc[:,'y_hat'] = 0  \n        df.loc[(df['IsActiveMember'] == 0) &amp; (df['Age'] &gt;= 50), 'y_hat'] = 1  \n        return df['y_hat']<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Running the benchmark<\/h3>\n<p class=\"wp-block-paragraph\">To run the benchmark I will use the following class. The entry point is the method\u00a0<code>compare_with_benchmark()<\/code>\u00a0that, given a prediction, runs all the models and calculates all the metrics.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np  \n  \nclass ChurnBinaryBenchmark():  \n    def __init__(        \n\t    self,  \n        metrics = [],  \n        benchmark_models = [],        \n        ):  \n        self.metrics = metrics  \n        self.benchmark_models = benchmark_models  \n  \n    def compare_pred_with_benchmark(        \n\t    self,  \n        df_train,  \n        df_test,  \n        my_predictions,    \n        ):  \n       \n        output_metrics = {  \n            'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  \n        }  \n        dct_benchmarks = {}  \n  \n        for model in self.benchmark_models:  \n            dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test)  \n            output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  \n  \n        return output_metrics  \n      \n    def _calculate_metrics(self, y_true, y_pred):  \n        return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning.<\/p>\n<p class=\"wp-block-paragraph\">The last step is just to run the benchmark:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">binary_benchmark = ChurnBinaryBenchmark(  \n    metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  \n    benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  \n    )  \n  \nres = binary_benchmark.compare_pred_with_benchmark(  \n    df_train=df_train,  \n    df_test=df_test,  \n    my_predictions=preds,  \n)  \n  \npd.DataFrame(res)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"289\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-134-1024x289.png?resize=1024%2C289&#038;ssl=1\" alt=\"\" class=\"wp-image-604206\"><figcaption class=\"wp-element-caption\">Benchmark metrics comparison | Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This generates a\u00a0<strong>comparison table<\/strong>\u00a0of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model\u2019s predictions and make informed decisions on the following steps of the process.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Some drawbacks<\/h2>\n<p class=\"wp-block-paragraph\">As we\u2019ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some\u00a0<strong>pitfalls<\/strong>\u00a0to watch out for:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Non-Informative Benchmark<\/strong>\u00a0\u2014 When the metrics or models are poorly defined the marginal impact of having a benchmark decreases.\u00a0<em>Always define meaningful baselines.<\/em>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Misinterpretation by Stakeholders<\/strong>\u00a0\u2014 Communication with the client is essential, it is important to state clearly what the metrics are measuring.\u00a0<em>The best model might not be the best on all the defined metrics.<\/em>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Overfitting to the Benchmark<\/strong>\u00a0\u2014 You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction.\u00a0<em>Don\u2019t focus on beating the benchmark, but on creating the best solution possible to the problem.<\/em>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Change of Objective<\/strong>\u00a0\u2014 Objectives defined might change, due to miscommunication or changes in plans.\u00a0<em>Keep your benchmark flexible so it can adapt when needed.<\/em>\n<\/li>\n<\/ol>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Final thoughts<\/h2>\n<p class=\"wp-block-paragraph\">Benchmarks provide clarity, ensure improvements are measurable, and create a\u00a0<strong>shared reference point<\/strong>\u00a0between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value.<\/p>\n<p class=\"wp-block-paragraph\">They also act as a\u00a0<strong>communication tool<\/strong>, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements.<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/lorenzomezzini\/MediumPosts\/blob\/main\/Benchmarking\/benchmarking.ipynb\" target=\"_blank\"><em>Here you can find a notebook with a full implementation from this blog post<\/em><\/a><em>.<\/em><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-to-build-a-benchmark-for-your-models\/\">How To Build a Benchmark for Your Models<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Lorenzo Mezzini<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-to-build-a-benchmark-for-your-models\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How To Build a Benchmark for Your Models I\u2019ve been working as a data science consultant for the past three years, and I\u2019ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,2697,83,70,1500],"tags":[1685,73,2698],"class_list":["post-3871","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-benchmarking","category-data-science","category-machine-learning","category-model-evaluation","tag-benchmark","tag-models","tag-objective"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3871"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3871"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3871\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3871"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3871"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3871"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}