{"id":2998,"date":"2025-04-10T07:02:46","date_gmt":"2025-04-10T07:02:46","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/10\/catboost-inner-workings-and-optimizations\/"},"modified":"2025-04-10T07:02:46","modified_gmt":"2025-04-10T07:02:46","slug":"catboost-inner-workings-and-optimizations","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/10\/catboost-inner-workings-and-optimizations\/","title":{"rendered":"Why CatBoost Works So Well: The Engineering Behind the Magic"},"content":{"rendered":"<p>    Why CatBoost Works So Well: The Engineering Behind the Magic<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1744219143668\" class=\"mdspan-comment\">Gradient boosting<\/mdspan> is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you\u2019ll see multiple options like LightGBM, XGBoost, etc. <a href=\"https:\/\/towardsdatascience.com\/tag\/catboost\/\" title=\"Catboost\">Catboost<\/a> is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.<\/p>\n<h2 class=\"wp-block-heading\">Target Statistic<\/h2>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"813\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/cb_target_encoding-813x1024.png?resize=813%2C1024&#038;ssl=1\" alt=\"Table illustrating target encoding for categorical values. It maps vehicle types\u2014Car, Bike, Bus, and Cycle\u2014to numerical target means: 3.9, 1.2, 11.7, and 0.8 respectively. A curved arrow at the bottom indicates the transformation from category to numeric value\" class=\"wp-image-601393\" style=\"width:433px;height:auto\"><figcaption class=\"wp-element-caption\">Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Target Encoding Example: the average value of the target variable for a category is used to replace each category<\/p>\n<p class=\"wp-block-paragraph\">One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you\u2019d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you\u2019d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.<\/p>\n<h3 class=\"wp-block-heading\">Greedy Target Statistic<\/h3>\n<p class=\"wp-block-paragraph\">To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values \u2014 <code>A<\/code>, <code>B<\/code>, <code>C<\/code> then we will calculate the average value of (text{y}) over all these values and replace these values with the average of (text{y}) at each unique value.<\/p>\n<p class=\"wp-block-paragraph\">That sounds good, right? It does but this approach comes with its problems \u2014 namely Target Leakage. To understand this, let\u2019s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:<\/p>\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\">\n<table class=\"has-fixed-layout\" style=\"border-style:none;border-width:0px\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Categorical Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Column<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">B<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">C<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">D<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">E<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Greedy Target Statistic: Compute the mean target value for each unique category<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now let\u2019s write the equation for calculating the Target Statistic:<br \/>[hat{x}^i_k = frac{<br \/>sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} cdot y_j + a p<br \/>}{<br \/>sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a<br \/>}]<\/p>\n<p class=\"wp-block-paragraph\">Here (x^i_j)  is the value of the <strong>i-th categorical feature<\/strong> for the <strong>j-th sample<\/strong>. So for the k-th sample, we iterate over all samples of (x^i), select the ones having the value (x^i_k), and take the average value of (y) over those samples. Instead of taking a direct average, we take a smoothened average which is what the (a) and (p) terms are for. The (a) parameter is the smoothening parameter and (p) is the global mean of (y).<\/p>\n<p class=\"wp-block-paragraph\">If we calculate the Target Statistic using the formula above, we get:<\/p>\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\">\n<table class=\"has-fixed-layout\" style=\"border-style:none;border-width:0px\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Categorical Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Statistic<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{ap}{1+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">B<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{1+ap}{1+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">C<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{ap}{1+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">D<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{1+ap}{1+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">E<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{ap}{1+a})<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Calculation of Greedy Target Statistic with Smoothening<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now if I use this <code>Target Statistic<\/code> column as my training data, I will get a perfect split at ( threshold = frac{0.5+ap}{1+a}). Anything above this value will be classified as <code>1<\/code> and anything below will be classified as <code>0<\/code>. I have a perfect classification at this point, so I get 100% accuracy on my training data.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes\u2014<br \/>[TS = frac{0+ap}{0+a} = p]<br \/>If (threshold) is greater than (p), all test data predictions will be (0). Conversely, if (threshold) is less than (p), all test data predictions will be (1) leading to poor performance on the test set.<\/p>\n<p class=\"wp-block-paragraph\">Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.<\/p>\n<h3 class=\"wp-block-heading\">Leave One Out Target Statistic<\/h3>\n<p class=\"wp-block-paragraph\">So the Greedy TS didn\u2019t work out quite well for us. Let\u2019s try another method\u2014 the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let\u2019s see how with another extreme example. This time let\u2019s assume that our categorical variable (x^i) has only one unique value, i.e., all values are the same. Consider the below data:<\/p>\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\">\n<table class=\"has-fixed-layout\" style=\"border-style:none;border-width:0px\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Categorical Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Column<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Example data for an extreme case where a categorical feature has just one unique value<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">If calculate the leave one out target statistic, we get:<\/p>\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\">\n<table class=\"has-fixed-layout\" style=\"border-style:none;border-width:0px\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Categorical Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Statistic<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{n^+ -y_k + ap}{n+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{n^+ -y_k + ap}{n+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{n^+ -y_k + ap}{n+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{n^+ -y_k + ap}{n+a})<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Calculation of Leave One Out Target Statistic with Smoothening<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Here:<br \/>(n) is the total samples in the data (in our case this 4)<br \/>(n^+) is the number of positive samples in the data (in our case this 2)<br \/>(y_k) is the value of the target column in that row<br \/>Substituting the above, we get:<\/p>\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\">\n<table class=\"has-fixed-layout\" style=\"border-style:none;border-width:0px\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Categorical Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Column<\/th>\n<th class=\"has-text-align-center\" data-align=\"center\">Target Statistic<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{2 + ap}{4+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{1 + ap}{4+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{2 + ap}{4+a})<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">A<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">(frac{1 + ap}{4+a})<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Substituing values of <code>n<\/code> and <code>n&lt;sup&gt;+&lt;\/sup&gt;<\/code><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now, if I use this <code>Target Statistic<\/code> column as my training data, I will get a perfect split at ( threshold = frac{1.5+ap}{4+a}). Anything above this value will be classified as <code>0<\/code> and anything below will be classified as <code>1<\/code>. I have a perfect classification at this point, so I again get 100% accuracy on my training data.<\/p>\n<p class=\"wp-block-paragraph\">You see the problem, right? My categorical variable which doesn\u2019t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.<\/p>\n<h3 class=\"wp-block-heading\">Ordered Target Statistic<\/h3>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"757\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/cb_ordered_boosting-1024x757.png?resize=1024%2C757&#038;ssl=1\" alt=\"Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples (Image by Author)\" class=\"wp-image-601391\"><figcaption class=\"wp-element-caption\">Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost\u2019s handling of categorical variables.<\/p>\n<p class=\"wp-block-paragraph\">This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data((sigma)). To compute the Target Statistic for a sample at row (k), CatBoost uses samples from row (1) to (k-1). For the test data, it uses the entire train data to compute the statistic.<\/p>\n<p class=\"wp-block-paragraph\">Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.<\/p>\n<h2 class=\"wp-block-heading\">Ordered Boosting<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"513\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/cb_algo-1024x513.png?resize=1024%2C513&#038;ssl=1\" alt=\"Diagram illustrating the ordered boosting mechanism in CatBoost. Data points x\u2081 through x\u1d62 are shown sequentially, with earlier samples used to compute predictions for later ones. Each x\u1d62 is associated with a model prediction M, where the prediction for x\u1d62 is computed using the model trained on previous data points. The equations show how residuals are calculated and how the model is updated: r\u1d57(x\u1d62, y\u1d62) = y\u1d62 \u2212 M\u207d\u1d57\u207b\u00b9\u207e\u1d62\u207b\u00b9(x\u1d62), and \u0394M is learned from samples with order less than or equal to i. Final model update: M\u1d62 = M\u1d62 + \u0394M.\" class=\"wp-image-601389\"><figcaption class=\"wp-element-caption\">This visualization shows how CatBoost computes residuals and updates the model: for sample x\u1d62, the model predicts using only earlier data points. <a href=\"https:\/\/doi.org\/10.1093\/fqsafe\/fyae007\">Source<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.<\/p>\n<p class=\"wp-block-paragraph\">In traditional boosting methods, when training tree (t), the model uses predictions from the previous tree (t\u22121) for all training samples, including the one it is currently predicting. This can lead to <strong>target leakage<\/strong>, as the model may indirectly use the label of the current sample during training.<\/p>\n<p class=\"wp-block-paragraph\">To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row (i) in the permutation, CatBoost calculates the output value of a leaf using only the samples before (i). The model uses this value to get the prediction for row (i). Thus, the model predicts each row <strong>without looking at its label<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.<br \/>Let\u2019s say we have 5 data points: <code>A, B, C, D, E<\/code>. CatBoost creates a <strong>random permutation<\/strong> of these points. Suppose the permutation is: <code>\u03c3 = [C, A, E, B, D]<\/code><\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th class=\"has-text-align-center\" data-align=\"center\">Step<\/th>\n<th>Data Used to Train<\/th>\n<th>Data Point Being Predicted<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">1<\/td>\n<td>\u2014<\/td>\n<td>C<\/td>\n<td>No previous data \u2192 use prior<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">2<\/td>\n<td>C<\/td>\n<td>A<\/td>\n<td>Model trained on C only<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">3<\/td>\n<td>C, A<\/td>\n<td>E<\/td>\n<td>Model trained on C, A<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">4<\/td>\n<td>C, A, E<\/td>\n<td>B<\/td>\n<td>Model trained on C, A, E<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">5<\/td>\n<td>C, A, E, B<\/td>\n<td>D<\/td>\n<td>Model trained on C, A, E, B<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Table highlighting how CatBoost uses random permutation to perform training<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This avoids using the actual label of the current row to get the prediction thus preventing <strong>leakage<\/strong>.<\/p>\n<h2 class=\"wp-block-heading\">Building a Tree<\/h2>\n<p class=\"wp-block-paragraph\">Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.<\/p>\n<p class=\"wp-block-paragraph\">CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.<\/p>\n<p>As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample\u2019s label for prediction, CatBoost effectively prevents data leakage.<\/p>\n<h2 class=\"wp-block-heading\">Split Candidates<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"916\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/cb_split_candids-1024x916.png?resize=1024%2C916&#038;ssl=1\" alt=\"Histogram showing how continuous features can be divided into bins\u2014CatBoost evaluates splits using these binned values instead of raw continuous values\" class=\"wp-image-601392\"><figcaption class=\"wp-element-caption\">CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split<\/p>\n<p class=\"wp-block-paragraph\">CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.<\/p>\n<h2 class=\"wp-block-heading\">Oblivious Trees<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"465\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/cb_tree_comp-1024x465.png?resize=1024%2C465&#038;ssl=1\" alt=\"Comparison between Oblivious Trees and Regular Trees. The Oblivious Tree on the left applies the same split condition at each level across all nodes, resulting in a symmetric structure. The Regular Tree on the right applies different conditions at each node, leading to an asymmetric structure with varied splits at different depths\" class=\"wp-image-601394\"><figcaption class=\"wp-element-caption\">Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train<\/li>\n<li class=\"wp-block-list-item\">Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions<\/li>\n<li class=\"wp-block-list-item\">Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like <strong>Ordered Target Statistics<\/strong>, <strong>Ordered Boosting<\/strong>, and the use of <strong>Oblivious Trees<\/strong>, it efficiently balances robustness and accuracy.<\/p>\n<p class=\"wp-block-paragraph\">If you found this deep dive helpful, you might enjoy another deep dive on the differences between <a href=\"https:\/\/shubhamgandhi.net\/model-deep-dives\/sgd-classifier-vs-logistic-regression\/\" target=\"_blank\" rel=\"noreferrer noopener\">Stochastic Gradient Classifer and Logistic Regression<\/a><\/p>\n<h2 class=\"wp-block-heading\">Further Reading<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/pdf\/1706.09516\" target=\"_blank\" rel=\"noreferrer noopener\">CatBoost: unbiased boosting with categorical features<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/catboost.ai\/docs\/en\/concepts\/algorithm-main-stages\">CatBoost: How training is performed<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/catboost.ai\/news\/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus\">CatBoost Enables Fast Gradient Boosting on Decision Trees Using GPUs<\/a><\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/catboost-inner-workings-and-optimizations\/\">Why CatBoost Works So Well: The Engineering Behind the Magic<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shubham Gandhi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/catboost-inner-workings-and-optimizations\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why CatBoost Works So Well: The Engineering Behind the Magic Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you\u2019ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,2339,83,1256,70,229],"tags":[1503,2340,926],"class_list":["post-2998","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-catboost","category-data-science","category-gradient-boosting","category-machine-learning","category-math","tag-categorical","tag-statistic","tag-target"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2998"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2998"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2998\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2998"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2998"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2998"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}