{"id":1112,"date":"2025-01-11T07:02:48","date_gmt":"2025-01-11T07:02:48","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/11\/model-calibration-explained-a-visual-guide-with-code-examples-for-beginners-55f368bafe72\/"},"modified":"2025-01-11T07:02:48","modified_gmt":"2025-01-11T07:02:48","slug":"model-calibration-explained-a-visual-guide-with-code-examples-for-beginners-55f368bafe72","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/11\/model-calibration-explained-a-visual-guide-with-code-examples-for-beginners-55f368bafe72\/","title":{"rendered":"Model Calibration, Explained: A Visual Guide with Code Examples for Beginners"},"content":{"rendered":"<p>    Model Calibration, Explained: A Visual Guide with Code Examples for Beginners<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>MODEL EVALUATION &amp; OPTIMIZATION<\/h4>\n<h4>When all models have similar accuracy, now\u00a0what?<\/h4>\n<p>You\u2019ve trained several classification models, and they all seem to be performing well with high accuracy scores. Congratulations!<\/p>\n<p>But hold on\u200a\u2014\u200ais one model truly better than the others? Accuracy alone doesn\u2019t tell the whole story. What if one model consistently overestimates its confidence, while another underestimates it? This is where <strong>model calibration<\/strong> comes\u00a0in.<\/p>\n<p>Here, we\u2019ll see what model calibration is and explore how to assess the reliability of your models\u2019 predictions\u200a\u2014\u200ausing visuals and practical code examples to show you how to identify calibration issues. Get ready to go beyond accuracy and light up the true potential of your machine learning\u00a0models!<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AD8F9yFBSGWaKIzGQm8NYiQ.png?ssl=1\"><figcaption>All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on\u00a0desktop.<\/figcaption><\/figure>\n<h3><strong>Understanding Calibration<\/strong><\/h3>\n<p>Model calibration measures how well a model\u2019s <a href=\"https:\/\/towardsdatascience.com\/predicted-probability-explained-a-visual-guide-with-code-examples-for-beginners-7c34e8994ec2\">prediction probabilities<\/a> match its actual performance. A model that gives a 70% probability score should be correct 70% of the time for similar predictions. This means its probability scores should reflect the true likelihood of its predictions being\u00a0correct.<\/p>\n<h4>Why Calibration Matters<\/h4>\n<p>While accuracy tells us how often a model is correct overall, calibration tells us <strong>whether we can trust its probability scores<\/strong>. Two models might both have 90% accuracy, but one might give realistic probability scores while the other gives overly confident predictions. In many real applications, having reliable probability scores is just as important as having correct predictions.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2APrRS_sYDYy1VEcUUDFLpuw.png?ssl=1\"><figcaption>Two models that are equally accurate (70% correct) show different levels of confidence in their predictions. Model A uses balanced probability scores (0.3 and 0.7) while Model B only uses extreme probabilities (0.0 and 1.0), showing it\u2019s either completely sure or completely unsure about each prediction.<\/figcaption><\/figure>\n<h4>Perfect Calibration vs.\u00a0Reality<\/h4>\n<p>A perfectly calibrated model would show a direct match between its prediction probabilities and actual success rates: When it predicts with 90% probability, it should be correct 90% of the time. The same applies to all probability levels.<\/p>\n<p>However, most models aren\u2019t perfectly calibrated. They can\u00a0be:<\/p>\n<ul>\n<li>Overconfident: giving probability scores that are too high for their actual performance<\/li>\n<li>Underconfident: giving probability scores that are too low for their actual performance<\/li>\n<li>Both: overconfident in some ranges and underconfident in\u00a0others<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2An8jSXAr3jlGjd3tV3dgQHQ.png?ssl=1\"><figcaption>Four models with the same accuracy (70%) showing different calibration patterns. The overconfident model makes extreme predictions (0.0 or 1.0), while the underconfident model stays close to 0.5. The over-and-under confident model switches between extremes and middle values. The well-calibrated model uses reasonable probabilities (0.3 for \u2018NO\u2019 and 0.7 for \u2018YES\u2019) that match its actual performance.<\/figcaption><\/figure>\n<p>This mismatch between predicted probabilities and actual correctness can lead to poor decision-making when using these models in real applications. This is why understanding and improving model calibration is necessary for building reliable machine learning\u00a0systems.<\/p>\n<h3>\ud83d\udcca Dataset\u00a0Used<\/h3>\n<p>To explore model calibration, we\u2019ll continue with <a href=\"https:\/\/medium.com\/@samybaladram\/list\/classification-algorithms-b3586f0a772c\">the same dataset used in my previous articles on Classification Algorithms<\/a>: predicting whether someone will play golf or not based on weather conditions.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AEiN9aTP0wioYtNVMpZwAHA.png?ssl=1\"><figcaption>Columns: \u2018Overcast (one-hot-encoded into 3 columns)\u2019, \u2019Temperature\u2019 (in Fahrenheit), \u2018Humidity\u2019 (in %), \u2018Windy\u2019 (Yes\/No) and \u2018Play\u2019 (Yes\/No, target\u00a0feature)<\/figcaption><\/figure>\n<pre>import pandas as pd<br>import numpy as np<br>from sklearn.metrics import accuracy_score<br>from sklearn.model_selection import train_test_split<br><br># Create and prepare dataset<br>dataset_dict = {<br>    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', <br>                'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',<br>                'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',<br>                'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],<br>    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,<br>                   72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,<br>                   88.0, 77.0, 79.0, 80.0, 66.0, 84.0],<br>    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,<br>                 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,<br>                 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],<br>    'Wind': [False, True, False, False, False, True, True, False, False, False, True,<br>             True, False, True, True, False, False, True, False, True, True, False,<br>             True, False, False, True, False, False],<br>    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',<br>             'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',<br>             'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']<br>}<br># Prepare data<br>df = pd.DataFrame(dataset_dict)<\/pre>\n<p>Before training our models, we normalized numerical weather measurements through <a href=\"https:\/\/medium.com\/towards-data-science\/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb?source=your_stories_page-------------------------------------\">standard scaling<\/a> and transformed categorical features with <a href=\"https:\/\/towardsdatascience.com\/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae\">one-hot encoding<\/a>. These preprocessing steps ensure all models can effectively use the data while maintaining fair comparisons between\u00a0them.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AT-JB7XdN5wVyFqL6mkOIpg.png?ssl=1\"><\/figure>\n<pre>from sklearn.preprocessing import StandardScaler<br>df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)<br>df['Wind'] = df['Wind'].astype(int)<br>df['Play'] = (df['Play'] == 'Yes').astype(int)<br><br># Rearrange columns<br>column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']<br>df = df[column_order]<br><br># Prepare features and target<br>X,y = df.drop('Play', axis=1), df['Play']<br>X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)<br><br># Scale numerical features<br>scaler = StandardScaler()<br>X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])<br>X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])<\/pre>\n<h4>Models and\u00a0Training<\/h4>\n<p>For this exploration, we trained four classification models to similar accuracy\u00a0scores:<\/p>\n<ul>\n<li>K-Nearest Neighbors (kNN)<\/li>\n<li>Bernoulli Naive\u00a0Bayes<\/li>\n<li>Logistic Regression<\/li>\n<li>Multi-Layer Perceptron (MLP)<\/li>\n<\/ul>\n<p>For those who are curious with how those algorithms make prediction and their probability, you can refer to this\u00a0article:<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/predicted-probability-explained-a-visual-guide-with-code-examples-for-beginners-7c34e8994ec2\">Predicted Probability, Explained: A Visual Guide with Code Examples for Beginners<\/a><\/p>\n<p>While these models achieved the same accuracy in this simple problem, they calculate their prediction probabilities differently.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AfkNi_MNiU1E0BoS22pWDow.png?ssl=1\"><figcaption>Even though the four models are correct 85.7% of the time, they show different levels of confidence in their predictions. Here, The MLP model tends to be very sure about its answers (giving values close to 1.0), while the kNN model is more careful, giving more varied confidence scores.<\/figcaption><\/figure>\n<pre>import numpy as np<br>from sklearn.neighbors import KNeighborsClassifier<br>from sklearn.tree import DecisionTreeClassifier<br>from sklearn.linear_model import LogisticRegression<br>from sklearn.neural_network import MLPClassifier<br>from sklearn.metrics import accuracy_score<br>from sklearn.naive_bayes import BernoulliNB<br><br># Initialize the models with the found parameters<br>knn = KNeighborsClassifier(n_neighbors=4, weights='distance')<br>bnb = BernoulliNB()<br>lr = LogisticRegression(C=1, random_state=42)<br>mlp = MLPClassifier(hidden_layer_sizes=(4, 2),random_state=42, max_iter=2000)<br><br># Train all models<br>models = {<br>    'KNN': knn,<br>    'BNB': bnb,<br>    'LR': lr,<br>    'MLP': mlp<br>}<br><br>for name, model in models.items():<br>    model.fit(X_train, y_train)<br><br># Create predictions and probabilities for each model<br>results_dict = {<br>    'True Labels': y_test<br>}<br><br>for name, model in models.items():<br>#    results_dict[f'{name} Pred'] = model.predict(X_test)<br>    results_dict[f'{name} Prob'] = model.predict_proba(X_test)[:, 1]<br><br># Create results dataframe<br>results_df = pd.DataFrame(results_dict)<br><br># Print predictions and probabilities<br>print(\"nPredictions and Probabilities:\")<br>print(results_df)<br><br># Print accuracies<br>print(\"nAccuracies:\")<br>for name, model in models.items():<br>    accuracy = accuracy_score(y_test, model.predict(X_test))<br>    print(f\"{name}: {accuracy:.3f}\")<\/pre>\n<p>Through these differences, we\u2019ll explore why we need to look beyond accuracy.<\/p>\n<h3>Measuring Calibration<\/h3>\n<p>To assess how well a model\u2019s prediction probabilities match its actual performance, we use several methods and metrics. These measurements help us understand whether our model\u2019s confidence levels are reliable.<\/p>\n<h4>Brier Score<\/h4>\n<p><strong>The Brier Score<\/strong> measures the <strong>mean squared difference<\/strong> between predicted probabilities and actual outcomes. It ranges from 0 to 1, where lower scores indicate better calibration. This score is particularly useful because it considers both calibration and accuracy together.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A2X9aJEpKRCKClV8HY7xqCg.png?ssl=1\"><figcaption>The score (0.148) shows how well the model\u2019s confidence matches its actual performance. It\u2019s found by comparing the model\u2019s predicted chances with what actually happened (0 for \u2018NO\u2019, 1 for \u2018YES\u2019), where smaller differences mean better predictions.<\/figcaption><\/figure>\n<h4>Log Loss<\/h4>\n<p><strong>Log Loss<\/strong> calculates the negative log probability of correct predictions. This metric is especially sensitive to confident but wrong predictions\u200a\u2014\u200awhen a model says it\u2019s 90% sure but is wrong, it receives a much larger penalty than when it\u2019s 60% sure and wrong. Lower values indicate better calibration.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A6nvB6FGlluSDPd0DSlKRVA.png?ssl=1\"><figcaption>For each prediction, it looks at how confident the model was in the correct answer. When the model is very confident but wrong (like in index 26), it gets a bigger penalty. The final score of 0.455 is the average of all these penalties, where lower numbers mean better predictions.<\/figcaption><\/figure>\n<h4>Expected Calibration Error\u00a0(ECE)<\/h4>\n<p><strong>ECE<\/strong> measures the average difference between predicted and actual probabilities (taken as average of the label), weighted by how many predictions fall into each probability group. This metric helps us understand if our model has systematic biases in its probability estimates.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AmUnGPaLSmDhwgXOfQWCBNQ.png?ssl=1\"><figcaption>The predictions are grouped into 5 bins based on how confident the model was. For each group, we compare the model\u2019s average confidence to how often it was actually right. The final score (0.1502) tells us how well these match up, where lower numbers are\u00a0better.\u201d<\/figcaption><\/figure>\n<h4>Reliability Diagrams<\/h4>\n<p>Similar to ECE, a reliability diagram (or calibration curve) visualizes model calibration by binning predictions and comparing them to actual outcomes. While ECE gives us a single number measuring calibration error, the reliability diagram <strong>shows us the same information graphically<\/strong>. We use the same binning approach and calculate the actual frequency of positive outcomes in each bin. When plotted, these points show us exactly where our model\u2019s predictions deviate from perfect calibration, which would appear as a diagonal\u00a0line.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A1ZE1JT_WUnn3YqN9Uy9iZA.png?ssl=1\"><figcaption>Like ECE, the predictions are grouped into 5 bins based on confidence levels. Each dot shows how often the model was actually right (up\/down) compared to how confident it was (left\/right). The dotted line shows perfect matching\u200a\u2014\u200athe model\u2019s curve shows it sometimes thinks it\u2019s better or worse than it really\u00a0is.<\/figcaption><\/figure>\n<h4>Comparing Calibration Metrics<\/h4>\n<p>Each of these metrics shows different aspects of calibration problems:<\/p>\n<ul>\n<li>A high Brier Score suggests overall poor probability estimates.<\/li>\n<li>High Log Loss points to overconfident wrong predictions.<\/li>\n<li>A high ECE indicates systematic bias in probability estimates.<\/li>\n<\/ul>\n<p>Together, these metrics give us a complete picture of how well our model\u2019s probability scores reflect its true performance.<\/p>\n<h4>Our Models<\/h4>\n<p>For our models, let\u2019s calculate the calibration metrics and draw their calibration curves:<\/p>\n<pre>from sklearn.metrics import brier_score_loss, log_loss<br>from sklearn.calibration import calibration_curve<br>import matplotlib.pyplot as plt<br><br># Initialize models<br>models = {<br>    'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),<br>    'Bernoulli Naive Bayes': BernoulliNB(),<br>    'Logistic Regression': LogisticRegression(C=1.5, random_state=42),<br>    'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)<br>}<br><br># Get predictions and calculate metrics<br>metrics_dict = {}<br>for name, model in models.items():<br>    model.fit(X_train, y_train)<br>    y_prob = model.predict_proba(X_test)[:, 1]<br>    metrics_dict[name] = {<br>        'Brier Score': brier_score_loss(y_test, y_prob),<br>        'Log Loss': log_loss(y_test, y_prob),<br>        'ECE': calculate_ece(y_test, y_prob),<br>        'Probabilities': y_prob<br>    }<br><br># Plot calibration curves<br>fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)<br>colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']<br><br>for idx, (name, metrics) in enumerate(metrics_dict.items()):<br>    ax = axes.ravel()[idx]<br>    prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'], <br>                                           n_bins=5, strategy='uniform')<br>    <br>    ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')<br>    ax.plot(prob_pred, prob_true, color=colors[idx], marker='o', <br>            label='Calibration curve', linewidth=2, markersize=8)<br>    <br>    title = f'{name}nBrier: {metrics[\"Brier Score\"]:.3f} | Log Loss: {metrics[\"Log Loss\"]:.3f} | ECE: {metrics[\"ECE\"]:.3f}'<br>    ax.set_title(title, fontsize=11, pad=10)<br>    ax.grid(True, alpha=0.7)<br>    ax.set_xlim([-0.05, 1.05])<br>    ax.set_ylim([-0.05, 1.05])<br>    ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)<br>    ax.legend(fontsize=10, loc='upper left')<br><br>plt.tight_layout()<br>plt.show()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AK04o6vy013s-jb4uR91V1Q.png?ssl=1\"><\/figure>\n<p>Now, let\u2019s analyze the calibration performance of each model based on those\u00a0metrics:<\/p>\n<p>The k-Nearest Neighbors (KNN) model performs well at estimating how certain it should be about its predictions. Its graph line stays close to the dotted line, which shows good performance. It has solid scores\u200a\u2014\u200aa Brier score of 0.148 and the best ECE score of 0.090. While it sometimes shows too much confidence in the middle range, it <strong>generally makes reliable estimates<\/strong> about its certainty.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AUHL3W15Z1qHAdi4EjG7Qxg.png?ssl=1\"><\/figure>\n<p>The Bernoulli Naive Bayes model shows an unusual stair-step pattern in its line. This means it jumps between different levels of certainty instead of changing smoothly. While it has the same Brier score as KNN (0.148), its higher ECE of 0.150 shows it\u2019s less accurate at estimating its certainty. The model <strong>switches between being too confident and not confident enough<\/strong>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AUWbk8Q6F_CfTl-pyJOKh-Q.png?ssl=1\"><\/figure>\n<p>The Logistic Regression model shows clear issues with its predictions. Its line moves far away from the dotted line, meaning it often misjudges how certain it should be. It has the worst ECE score (0.181) and a poor Brier score (0.164). The model consistently shows <strong>too much confidence in its predictions<\/strong>, making it unreliable.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AeDNNG7Hl1MJm6QER6kCOHQ.png?ssl=1\"><\/figure>\n<p>The Multilayer Perceptron shows a distinct problem. Despite having the best Brier score (0.129), its line reveals that <strong>it mostly makes extreme predictions\u200a<\/strong>\u2014\u200aeither very certain or very uncertain, with little in between. Its high ECE (0.167) and flat line in the middle ranges show it struggles to make balanced certainty estimates.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AaBVwcq7ZeVUhlKVbhWugwg.png?ssl=1\"><\/figure>\n<p>After examining all four models, the <strong>k-Nearest Neighbors clearly performs best<\/strong> at estimating its prediction certainty. It maintains consistent performance across different levels of certainty and shows the most reliable pattern in its predictions. While other models might score well in certain measures (like the Multilayer Perceptron\u2019s Brier score), their graphs reveal they aren\u2019t as reliable when we need to trust their certainty estimates.<\/p>\n<h3>Final Remark<\/h3>\n<p>When choosing between different models, we need to consider both their accuracy and calibration quality. A model with slightly lower accuracy but better calibration might be more valuable than a highly accurate model with poor probability estimates.<\/p>\n<p>By understanding calibration and its importance, we can build more reliable machine learning systems that users can trust not just for their predictions, but also for their confidence in those predictions.<\/p>\n<h3>\ud83c\udf1f Model Calibration Code Summarized (1\u00a0Model)<\/h3>\n<pre>import pandas as pd<br>import numpy as np<br>from sklearn.preprocessing import StandardScaler<br>from sklearn.model_selection import train_test_split<br>from sklearn.naive_bayes import BernoulliNB<br>from sklearn.metrics import brier_score_loss, log_loss<br>from sklearn.calibration import calibration_curve<br>import matplotlib.pyplot as plt<br><br># Define ECE<br>def calculate_ece(y_true, y_prob, n_bins=5):<br>    bins = np.linspace(0, 1, n_bins + 1)<br>    ece = 0<br>    for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):<br>        mask = (y_prob &gt;= bin_lower) &amp; (y_prob &lt; bin_upper)<br>        if np.sum(mask) &gt; 0:<br>            bin_conf = np.mean(y_prob[mask])<br>            bin_acc = np.mean(y_true[mask])<br>            ece += np.abs(bin_conf - bin_acc) * np.sum(mask)<br>    return ece \/ len(y_true)<br><br># Create dataset and prepare data<br>dataset_dict = {<br>    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],<br>    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],<br>    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],<br>    'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],<br>    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']<br>}<br><br># Prepare and encode data<br>df = pd.DataFrame(dataset_dict)<br>df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)<br>df['Wind'] = df['Wind'].astype(int)<br>df['Play'] = (df['Play'] == 'Yes').astype(int)<br>df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]<br><br># Split and scale data<br>X, y = df.drop('Play', axis=1), df['Play']<br>X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)<br>scaler = StandardScaler()<br>X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])<br>X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])<br><br># Train model and get predictions<br>model = BernoulliNB()<br>model.fit(X_train, y_train)<br>y_prob = model.predict_proba(X_test)[:, 1]<br><br># Calculate metrics<br>metrics = {<br>    'Brier Score': brier_score_loss(y_test, y_prob),<br>    'Log Loss': log_loss(y_test, y_prob),<br>    'ECE': calculate_ece(y_test, y_prob)<br>}<br><br># Plot calibration curve<br>plt.figure(figsize=(6, 6), dpi=300)<br>prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=5, strategy='uniform')<br><br>plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')<br>plt.plot(prob_pred, prob_true, color='slategrey', marker='o', <br>        label='Calibration curve', linewidth=2, markersize=8)<br><br>title = f'Bernoulli Naive BayesnBrier: {metrics[\"Brier Score\"]:.3f} | Log Loss: {metrics[\"Log Loss\"]:.3f} | ECE: {metrics[\"ECE\"]:.3f}'<br>plt.title(title, fontsize=11, pad=10)<br>plt.grid(True, alpha=0.7)<br>plt.xlim([-0.05, 1.05])<br>plt.ylim([-0.05, 1.05])<br>plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)<br>plt.legend(fontsize=10, loc='lower right')<br><br>plt.tight_layout()<br>plt.show()<\/pre>\n<h3>\ud83c\udf1f Model Calibration Code Summarized (4\u00a0Models)<\/h3>\n<pre>import pandas as pd<br>import numpy as np<br>from sklearn.preprocessing import StandardScaler<br>from sklearn.model_selection import train_test_split<br>from sklearn.neighbors import KNeighborsClassifier<br>from sklearn.naive_bayes import BernoulliNB<br>from sklearn.linear_model import LogisticRegression<br>from sklearn.neural_network import MLPClassifier<br>from sklearn.metrics import brier_score_loss, log_loss<br>from sklearn.calibration import calibration_curve<br>import matplotlib.pyplot as plt<br><br># Define ECE<br>def calculate_ece(y_true, y_prob, n_bins=5):<br>    bins = np.linspace(0, 1, n_bins + 1)<br>    ece = 0<br>    for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):<br>        mask = (y_prob &gt;= bin_lower) &amp; (y_prob &lt; bin_upper)<br>        if np.sum(mask) &gt; 0:<br>            bin_conf = np.mean(y_prob[mask])<br>            bin_acc = np.mean(y_true[mask])<br>            ece += np.abs(bin_conf - bin_acc) * np.sum(mask)<br>    return ece \/ len(y_true)<br><br># Create dataset and prepare data<br>dataset_dict = {<br>    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],<br>    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],<br>    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],<br>    'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],<br>    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']<br>}<br><br># Prepare and encode data<br>df = pd.DataFrame(dataset_dict)<br>df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)<br>df['Wind'] = df['Wind'].astype(int)<br>df['Play'] = (df['Play'] == 'Yes').astype(int)<br>df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]<br><br># Split and scale data<br>X, y = df.drop('Play', axis=1), df['Play']<br>X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)<br>scaler = StandardScaler()<br>X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])<br>X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])<br><br># Initialize models<br>models = {<br>    'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),<br>    'Bernoulli Naive Bayes': BernoulliNB(),<br>    'Logistic Regression': LogisticRegression(C=1.5, random_state=42),<br>    'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)<br>}<br><br># Get predictions and calculate metrics<br>metrics_dict = {}<br>for name, model in models.items():<br>    model.fit(X_train, y_train)<br>    y_prob = model.predict_proba(X_test)[:, 1]<br>    metrics_dict[name] = {<br>        'Brier Score': brier_score_loss(y_test, y_prob),<br>        'Log Loss': log_loss(y_test, y_prob),<br>        'ECE': calculate_ece(y_test, y_prob),<br>        'Probabilities': y_prob<br>    }<br><br># Plot calibration curves<br>fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)<br>colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']<br><br>for idx, (name, metrics) in enumerate(metrics_dict.items()):<br>    ax = axes.ravel()[idx]<br>    prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'], <br>                                           n_bins=5, strategy='uniform')<br>    <br>    ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')<br>    ax.plot(prob_pred, prob_true, color=colors[idx], marker='o', <br>            label='Calibration curve', linewidth=2, markersize=8)<br>    <br>    title = f'{name}nBrier: {metrics[\"Brier Score\"]:.3f} | Log Loss: {metrics[\"Log Loss\"]:.3f} | ECE: {metrics[\"ECE\"]:.3f}'<br>    ax.set_title(title, fontsize=11, pad=10)<br>    ax.grid(True, alpha=0.7)<br>    ax.set_xlim([-0.05, 1.05])<br>    ax.set_ylim([-0.05, 1.05])<br>    ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)<br>    ax.legend(fontsize=10, loc='upper left')<br><br>plt.tight_layout()<br>plt.show()<\/pre>\n<h4>Technical Environment<\/h4>\n<p>This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.<\/p>\n<h4>About the Illustrations<\/h4>\n<p>Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva\u00a0Pro.<\/p>\n<p>\ud835\ude4e\ud835\ude5a\ud835\ude5a \ud835\ude62\ud835\ude64\ud835\ude67\ud835\ude5a \ud835\ude48\ud835\ude64\ud835\ude59\ud835\ude5a\ud835\ude61 \ud835\ude40\ud835\ude6b\ud835\ude56\ud835\ude61\ud835\ude6a\ud835\ude56\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63 &amp; \ud835\ude4a\ud835\ude65\ud835\ude69\ud835\ude5e\ud835\ude62\ud835\ude5e\ud835\ude6f\ud835\ude56\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63 \ud835\ude62\ud835\ude5a\ud835\ude69\ud835\ude5d\ud835\ude64\ud835\ude59\ud835\ude68 \ud835\ude5d\ud835\ude5a\ud835\ude67\ud835\ude5a:<\/p>\n<p><a href=\"https:\/\/medium.com\/@samybaladram\/list\/331287896864\">Model Evaluation &amp; Optimization<\/a><\/p>\n<p>\ud835\ude54\ud835\ude64\ud835\ude6a \ud835\ude62\ud835\ude5e\ud835\ude5c\ud835\ude5d\ud835\ude69 \ud835\ude56\ud835\ude61\ud835\ude68\ud835\ude64 \ud835\ude61\ud835\ude5e\ud835\ude60\ud835\ude5a:<\/p>\n<ul>\n<li><a href=\"https:\/\/medium.com\/@samybaladram\/list\/673fc83cd7db\">Ensemble Learning<\/a><\/li>\n<li><a href=\"https:\/\/medium.com\/@samybaladram\/list\/b3586f0a772c\">Classification Algorithms<\/a><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=55f368bafe72\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/model-calibration-explained-a-visual-guide-with-code-examples-for-beginners-55f368bafe72\">Model Calibration, Explained: A Visual Guide with Code Examples for Beginners<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Samy Baladram<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fmodel-calibration-explained-a-visual-guide-with-code-examples-for-beginners-55f368bafe72\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Model Calibration, Explained: A Visual Guide with Code Examples for Beginners MODEL EVALUATION &amp; OPTIMIZATION When all models have similar accuracy, now\u00a0what? You\u2019ve trained several classification models, and they all seem to be performing well with high accuracy scores. Congratulations! But hold on\u200a\u2014\u200ais one model truly better than the others? Accuracy alone doesn\u2019t tell the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1226,837,1224,1223,1225],"tags":[1227,103,921],"class_list":["post-1112","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-brier-score","category-classification","category-classification-metrics","category-log-loss","category-model-calibration","tag-calibration","tag-model","tag-probability"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1112"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1112"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1112\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1112"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1112"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1112"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}