{"id":2996,"date":"2025-04-10T07:02:43","date_gmt":"2025-04-10T07:02:43","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/10\/mining-rules-from-data\/"},"modified":"2025-04-10T07:02:43","modified_gmt":"2025-04-10T07:02:43","slug":"mining-rules-from-data","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/10\/mining-rules-from-data\/","title":{"rendered":"Mining Rules from\u00a0Data"},"content":{"rendered":"<p>    Mining Rules from\u00a0Data<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1744217464415\" class=\"mdspan-comment\">Working<\/mdspan> with products, we might face a need to introduce some \u201crules\u201d. Let me explain what I mean by \u201crules\u201d in practical examples:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Imagine that we\u2019re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For example, we found out that the majority of fraudsters had specific user agents and IP addresses from certain countries.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Another option is to send coupons to customers to use in our online shop. However, we would like to treat only customers who are likely to churn since loyal users will return to the product anyway. We might figure out that the most feasible group is customers who joined less than a year ago and decreased their spending by 30%+ last month.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Transactional businesses often have a segment of customers where they are losing money. For example, a bank customer passed the verification and regularly reached out to customer support (so generated onboarding and servicing costs) while doing almost no transactions (so not generating any revenue). The bank might introduce a small monthly subscription fee for customers with less than 1000$ in their account since they are likely non-profitable.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Of course, in all these cases, we might have used a complex Machine Learning model that would take into account all the factors and predict the probability (either of a customer being a fraudster or churning). Still, under some circumstances, we might prefer just a set of static rules for the following reasons: \u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>The speed and complexity of implementation. <\/strong>Deploying an ML model in production takes time and effort. If you are experiencing a fraud wave right now, it might be more feasible to go live with a set of static rules that can be implemented quickly and then work on a comprehensive solution.\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Interpretability. <\/strong>ML models are black boxes. Even though we might be able to understand at a high level how they work and what features are the most important ones, it\u2019s challenging to explain them to customers. In the example of subscription fees for non-profitable customers, it\u2019s important to share a set of transparent rules with customers so that they can understand the pricing.\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Compliance. <\/strong>Some industries, like finance or healthcare, might require auditable and rule-based decisions to meet compliance requirements.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">In this article, I want to show you how we can solve business problems using such rules. We will take a practical example and go really deep into this topic:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">we will discuss which models we can use to mine such rules from data,<\/li>\n<li class=\"wp-block-list-item\">we will build a <a href=\"https:\/\/towardsdatascience.com\/tag\/decision-tree\/\" title=\"Decision Tree\">Decision Tree<\/a> Classifier from scratch to learn how it works,<\/li>\n<li class=\"wp-block-list-item\">we will fit the <code>sklearn<\/code> Decision Tree Classifier model to extract the rules from the data,<\/li>\n<li class=\"wp-block-list-item\">we will learn how to parse the Decision Tree structure to get the resulting segments,<\/li>\n<li class=\"wp-block-list-item\">finally, we will explore different options for category encoding, since the <code>sklearn<\/code> implementation doesn\u2019t support categorical variables.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">We have lots of topics to cover, so let\u2019s jump into it.<\/p>\n<h2 class=\"wp-block-heading\">Case<\/h2>\n<p class=\"wp-block-paragraph\">As usual, it\u2019s easier to learn something with a practical example. So, let\u2019s start by discussing the task we will be solving in this article.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We will work with <a href=\"https:\/\/archive.ics.uci.edu\/dataset\/222\/bank+marketing\" target=\"_blank\" rel=\"noreferrer noopener\">the Bank Marketing<\/a> dataset (<mdspan datatext=\"el1744217074911\" class=\"mdspan-comment\">CC BY 4.0 license<\/mdspan>). This dataset contains data about the direct marketing campaigns of a Portuguese banking institution. For each customer, we know a bunch of features and whether they subscribed to a term deposit (our target).\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Our business goal is to maximise the number of conversions (subscriptions) with limited operational resources. So, we can\u2019t call the whole user base, and we want to reach the best outcome with the resources we have.<\/p>\n<p class=\"wp-block-paragraph\">The first step is to look at the data. So, let\u2019s load the data set.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pandas as pd\npd.set_option('display.max_colwidth', 5000)\npd.set_option('display.float_format', lambda x: '%.2f' % x)\n\ndf = pd.read_csv('bank-full.csv', sep = ';')\ndf = df.drop(['duration', 'campaign'], axis = 1)\n# removed columns related to the current marketing campaign, \n# since they introduce data leakage\n\ndf.head()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We know quite a lot about the customers, including personal data (such as job type or marital status) and their previous behaviour (such as whether they have a loan or their average yearly balance).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"204\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-22.12.18-1024x204.png?resize=1024%2C204&#038;ssl=1\" alt=\"\" class=\"wp-image-601337\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The next step is to select a machine-learning model. There are two classes of models that are usually used when we need something easily interpretable:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">decision trees,<\/li>\n<li class=\"wp-block-list-item\">linear or logistic regression.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Both options are feasible and can give us good models that can be easily implemented and interpreted. However, in this article, I would like to stick to the decision tree model because it produces actual rules, while logistic regression will give us probability as a weighted sum of features.<\/p>\n<h2 class=\"wp-block-heading\">Data Preprocessing\u00a0<\/h2>\n<p class=\"wp-block-paragraph\">As we\u2019ve seen in the data, there are lots of categorical variables (such as education or marital status). Unfortunately, the <code>sklearn<\/code> decision tree implementation can\u2019t handle categorical data, so we need to do some preprocessing.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s start by transforming yes\/no flags into integers.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">for p in ['default', 'housing', 'loan', 'y']:\n    df[p] = df[p].map(lambda x: 1 if x == 'yes' else 0)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next step is to transform the <code>month<\/code> variable. We can use one-hot encoding for months, introducing flags like <code>month_jan<\/code>\u00a0, <code>month_feb<\/code>\u00a0, etc. However, there might be seasonal effects, and I think it would be more reasonable to convert months into integers following their order.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">month_map = {\n    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, \n    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12\n}\n# I saved 5 mins by asking ChatGPT to do this mapping\n\ndf['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">For all other categorical variables, let\u2019s use one-hot encoding. We will discuss different strategies for category encoding later, but for now, let\u2019s stick to the default approach.<\/p>\n<p class=\"wp-block-paragraph\">The easiest way to do one-hot encoding is to leverage <code>get_dummies<\/code> <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.get_dummies.html\" rel=\"noreferrer noopener\" target=\"_blank\">function<\/a> in pandas.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">fin_df = pd.get_dummies(\n  df, columns=['job', 'marital', 'education', 'poutcome', 'contact'], \n  dtype = int, # to convert to flags 0\/1\n  drop_first = False # to keep all possible values\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This function transforms each categorical variable into a separate 1\/0 column for each possible. We can see how it works for <code>poutcome<\/code> column.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">fin_df.merge(df[['id', 'poutcome']])\n    .groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure', \n      'poutcome_other', 'poutcome_success'], as_index = False).y.count()\n    .rename(columns = {'y': 'cases'})\n    .sort_values('cases', ascending = False)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"228\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-22.34.41-1024x228.png?resize=1024%2C228&#038;ssl=1\" alt=\"\" class=\"wp-image-601338\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Our data is now ready, and it\u2019s time to discuss how decision tree classifiers work.<\/p>\n<h2 class=\"wp-block-heading\">Decision Tree Classifier: Theory<\/h2>\n<p class=\"wp-block-paragraph\">In this section, we\u2019ll explore the theory behind the Decision Tree Classifier and build the algorithm from scratch. If you\u2019re more interested in a practical example, feel free to skip ahead to the next part.<\/p>\n<p class=\"wp-block-paragraph\">The easiest way to understand the decision tree model is to look at an example. So, let\u2019s build a simple model based on our data. We will use <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\" target=\"_blank\" rel=\"noreferrer noopener\">DecisionTreeClassifier<\/a> from <code>sklearn<\/code>.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">feature_names = fin_df.drop(['y'], axis = 1).columns\nmodel = sklearn.tree.DecisionTreeClassifier(\n  max_depth = 2, min_samples_leaf = 1000)\nmodel.fit(fin_df[feature_names], fin_df['y'])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next step is to visualise the tree.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">dot_data = sklearn.tree.export_graphviz(\n    model, out_file=None, feature_names = feature_names, filled = True, \n    proportion = True, precision = 2 \n    # to show shares of classes instead of absolute numbers\n)\n\ngraph = graphviz.Source(dot_data)\ngraph<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"368\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-22.49.47-1024x368.png?resize=1024%2C368&#038;ssl=1\" alt=\"\" class=\"wp-image-601339\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">So, we can see that the model is straightforward. It\u2019s a set of binary splits that we can use as heuristics.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s figure out how the classifier works under the hood. As usual, the best way to understand the model is to build the logic from scratch.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The cornerstone of any problem is the optimisation function. By default, in the decision tree classifier, we\u2019re optimising <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gini_coefficient\" rel=\"noreferrer noopener\" target=\"_blank\">the Gini coefficient<\/a>. Imagine getting one random item from the sample and then the other. The Gini coefficient would equal the probability of the situation when these items are from different classes. So, our goal will be minimising the Gini coefficient.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In the case of just two classes (like in our example, where marketing intervention was either successful or not), the Gini coefficient is defined just by one parameter <code>p<\/code>\u00a0, where <code>p<\/code> is the probability of getting an item from one of the classes. Here\u2019s the formula:<\/p>\n<p class=\"wp-block-shortcode\">[textbf{gini}(textsf{p}) = 1 \u2013 textsf{p}^2 \u2013 (1 \u2013 textsf{p})^2 = 2 * textsf{p} * (1 \u2013 textsf{p}) ]<\/p>\n<p class=\"wp-block-paragraph\">If our classification is ideal and we are able to separate the classes perfectly, then the Gini coefficient will be equal to 0. The worst-case scenario is when <code>p = 0.5<\/code>\u00a0, then the Gini coefficient is also equal to 0.5.<\/p>\n<p class=\"wp-block-paragraph\">With the formula above, we can calculate the Gini coefficient for each leaf of the tree. To calculate the Gini coefficient for the whole tree, we need to combine the Gini coefficients of binary splits. For that, we can just get a weighted sum:<\/p>\n<p class=\"wp-block-shortcode\">[textbf{gini}_{textsf{total}} = textbf{gini}_{textsf{left}} * frac{textbf{n}_{textsf{left}}}{textbf{n}_{textsf{left}} + textbf{n}_{textsf{right}}} + textbf{gini}_{textsf{right}} * frac{textbf{n}_{textsf{right}}}{textbf{n}_{textsf{left}} + textbf{n}_{textsf{right}}}]<\/p>\n<p class=\"wp-block-paragraph\">Now that we know what value we\u2019re optimising, we only need to define all possible binary splits, iterate through them and choose the best option.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Defining all possible binary splits is also quite straightforward. We can do it one by one for each parameter, sort possible values, and pick up thresholds between them. For example, for months (integer from 1 to 12).\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"273\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-23.26.53-1024x273.png?resize=1024%2C273&#038;ssl=1\" alt=\"\" class=\"wp-image-601340\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s try to code it and see whether we will come to the same result. First, we will define functions that calculate the Gini coefficient for one dataset and the combination.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_gini(df):\n    p = df.y.mean()\n    return 2*p*(1-p)\n\nprint(get_gini(fin_df)) \n# 0.2065\n# close to what we see at the root node of Decision Tree\n\ndef get_gini_comb(df1, df2):\n    n1 = df1.shape[0]\n    n2 = df2.shape[0]\n\n    gini1 = get_gini(df1)\n    gini2 = get_gini(df2)\n    return (gini1*n1 + gini2*n2)\/(n1 + n2)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next step is to get all possible thresholds for one parameter and calculate their Gini coefficients.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import tqdm\ndef optimise_one_parameter(df, param):\n    tmp = []\n    possible_values = list(sorted(df[param].unique()))\n    print(param)\n\n    for i in tqdm.tqdm(range(1, len(possible_values))): \n        threshold = (possible_values[i-1] + possible_values[i])\/2\n        gini = get_gini_comb(df[df[param] &lt;= threshold], \n          df[df[param] &gt; threshold])\n        tmp.append(\n            {'param': param, \n            'threshold': threshold, \n            'gini': gini, \n            'sizes': (df[df[param] &lt;= threshold].shape[0], df[df[param] &gt; threshold].shape[0]))\n            }\n        )\n    return pd.DataFrame(tmp)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The final step is to iterate through all features and calculate all possible splits.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">tmp_dfs = []\nfor feature in feature_names:\n    tmp_dfs.append(optimise_one_parameter(fin_df, feature))\nopt_df = pd.concat(tmp_dfs)\nopt_df.sort_values('gini', asceding = True).head(5)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"258\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-23.55.59-1024x258.png?resize=1024%2C258&#038;ssl=1\" alt=\"\" class=\"wp-image-601342\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Wonderful, we\u2019ve got the same result as in our <code>DecisionTreeClassifier<\/code> model. The optimal split is whether <code>poutcome = success<\/code> or not. We\u2019ve reduced the Gini coefficient from 0.2065 to 0.1872.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">To continue building the tree, we need to repeat the process recursively. For example, going down for the <code>poutcome_success &lt;= 0.5<\/code> branch:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">tmp_dfs = []\nfor feature in feature_names:\n    tmp_dfs.append(optimise_one_parameter(\n      fin_df[fin_df.poutcome_success &lt;= 0.5], feature))\n\nopt_df = pd.concat(tmp_dfs)\nopt_df.sort_values('gini', ascending = True).head(5)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"250\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-04-at-23.57.56-1024x250.png?resize=1024%2C250&#038;ssl=1\" alt=\"\" class=\"wp-image-601343\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The only question we still need to discuss is the stopping criteria. In our initial example, we\u2019ve used two conditions:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>max_depth = 2<\/code>\u200a\u2014\u200ait just limits the maximum depth of the tree,\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<code>min_samples_leaf = 1000<\/code> prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we\u2019ve chosen a binary split by <code>contact_unknown<\/code> even though <code>age<\/code> led to a lower Gini coefficient.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Also, I usually limit the <code>min_impurity_decrease<\/code> that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.<\/p>\n<p class=\"wp-block-paragraph\">So, we\u2019ve understood how the Decision Tree Classifier works, and now it\u2019s time to use it in practice.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>If you\u2019re interested to see how Decision Tree Regressor works in all detail, you can look it up in <a href=\"https:\/\/towardsdatascience.com\/interpreting-random-forests-638bca8b49ea\/\" target=\"_blank\" rel=\"noreferrer noopener\">my previous article<\/a>.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Decision Trees:\u00a0practice<\/h2>\n<p class=\"wp-block-paragraph\">We\u2019ve already built a simple tree model with two layers, but it\u2019s definitely not enough since it\u2019s too simple to get all the insights from the data. Let\u2019s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model = sklearn.tree.DecisionTreeClassifier(\n  min_samples_leaf = 1000, min_impurity_decrease=0.001)\nmodel.fit(fin_df[features], fin_df['y'])\n\ndot_data = sklearn.tree.export_graphviz(\n    model, out_file=None, feature_names = features, filled = True, \n    proportion = True, precision=2, impurity = True)\n\ngraph = graphviz.Source(dot_data)\n\n# saving graph to png file\npng_bytes = graph.pipe(format='png')\nwith open('decision_tree.png','wb') as f:\n    f.write(png_bytes)<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-36.png?ssl=1\" alt=\"\" class=\"wp-image-601344\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">That\u2019s it. We\u2019ve got our rules to split customers into groups (leaves). Now, we can iterate through groups and see which groups of customers we want to contact. Even though our model is relatively small, it\u2019s daunting to copy all conditions from the image. Luckily, we can parse <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/tree\/plot_unveil_tree_structure.html\" rel=\"noreferrer noopener\" target=\"_blank\">the tree structure<\/a> and get all the groups from the model.<\/p>\n<p class=\"wp-block-paragraph\">The Decision Tree classifier has an attribute <code>tree_<\/code> that will allow us to get access to low-level attributes of the tree, such as <code>node_count<\/code>\u00a0.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">n_nodes = model.tree_.node_count\nprint(n_nodes)\n# 13<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The <code>tree_<\/code> variable also stores the entire tree structure as parallel arrays, where the <code>i<\/code><sub>th<\/sub> element of each array stores the information about the node <code>i<\/code>. For the root <code>i<\/code> equals to 0.<\/p>\n<p class=\"wp-block-paragraph\">Here are the arrays we have to represent the tree structure:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>children_left<\/code> and <code>children_right<\/code>\u200a\u2014\u200aIDs of left and right nodes, respectively; if the node is a leaf, then -1.<\/li>\n<li class=\"wp-block-list-item\">\n<code>feature<\/code>\u200a\u2014\u200afeature used to split the node <code>i<\/code>\u00a0.<\/li>\n<li class=\"wp-block-list-item\">\n<code>threshold<\/code>\u200a\u2014\u200athreshold value used for the binary split of the node <code>i<\/code>\u00a0.<\/li>\n<li class=\"wp-block-list-item\">\n<code>n_node_samples<\/code>\u200a\u2014\u200anumber of training samples that reached the node <code>i<\/code>\u00a0.<\/li>\n<li class=\"wp-block-list-item\">\n<code>values<\/code>\u200a\u2014\u200ashares of samples from each class.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Let\u2019s save all these arrays.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">children_left = model.tree_.children_left\n# [ 1,  2,  3,  4,  5,  6, -1, -1, -1, -1, -1, -1, -1]\nchildren_right = model.tree_.children_right\n# [12, 11, 10,  9,  8,  7, -1, -1, -1, -1, -1, -1, -1]\nfeatures = model.tree_.feature\n# [30, 34,  0,  3,  6,  6, -2, -2, -2, -2, -2, -2, -2]\nthresholds = model.tree_.threshold\n# [ 0.5,  0.5, 59.5,  0.5,  6.5,  2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]\nnum_nodes = model.tree_.n_node_samples\n# [45211, 43700, 30692, 29328, 14165,  4165,  2053,  2112, 10000, \n#  15163,  1364, 13008,  1511] \nvalues = model.tree_.value\n# [[[0.8830152 , 0.1169848 ]],\n# [[0.90135011, 0.09864989]],\n# [[0.87671054, 0.12328946]],\n# [[0.88550191, 0.11449809]],\n# [[0.8530886 , 0.1469114 ]],\n# [[0.76686675, 0.23313325]],\n# [[0.87043351, 0.12956649]],\n# [[0.66619318, 0.33380682]],\n# [[0.889     , 0.111     ]],\n# [[0.91578184, 0.08421816]],\n# [[0.68768328, 0.31231672]],\n# [[0.95948647, 0.04051353]],\n# [[0.35274653, 0.64725347]]]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">It will be more convenient for us to work with a hierarchical view of the tree structure, so let\u2019s iterate through all nodes and, for each node, save the parent node ID and whether it was a right or left branch.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">hierarchy = {}\n\nfor node_id in range(n_nodes):\n  if children_left[node_id] != -1: \n    hierarchy[children_left[node_id]] = {\n      'parent': node_id, \n      'condition': 'left'\n    }\n  \n  if children_right[node_id] != -1:\n      hierarchy[children_right[node_id]] = {\n       'parent': node_id, \n       'condition': 'right'\n  }\n\nprint(hierarchy)\n# {1: {'parent': 0, 'condition': 'left'},\n# 12: {'parent': 0, 'condition': 'right'},\n# 2: {'parent': 1, 'condition': 'left'},\n# 11: {'parent': 1, 'condition': 'right'},\n# 3: {'parent': 2, 'condition': 'left'},\n# 10: {'parent': 2, 'condition': 'right'},\n# 4: {'parent': 3, 'condition': 'left'},\n# 9: {'parent': 3, 'condition': 'right'},\n# 5: {'parent': 4, 'condition': 'left'},\n# 8: {'parent': 4, 'condition': 'right'},\n# 6: {'parent': 5, 'condition': 'left'},\n# 7: {'parent': 5, 'condition': 'right'}}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next step is to filter out the leaf nodes since they are terminal and the most interesting for us as they define the customer segments.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">leaves = []\nfor node_id in range(n_nodes):\n    if (children_left[node_id] == -1) and (children_right[node_id] == -1):\n        leaves.append(node_id)\nprint(leaves)\n# [6, 7, 8, 9, 10, 11, 12]\nleaves_df = pd.DataFrame({'node_id': leaves})<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next step is to determine all the conditions applied to each group since they will define our customer segments. The first function <code>get_condition<\/code> will give us the tuple of feature, condition type and threshold for a node.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_condition(node_id, condition, features, thresholds, feature_names):\n    # print(node_id, condition)\n    feature = feature_names[features[node_id]]\n    threshold = thresholds[node_id]\n    cond = '&gt;' if condition == 'right'  else '&lt;='\n    return (feature, cond, threshold)\n\nprint(get_condition(0, 'left', features, thresholds, feature_names)) \n# ('poutcome_success', '&lt;=', 0.5)\n\nprint(get_condition(0, 'right', features, thresholds, feature_names))\n# ('poutcome_success', '&gt;', 0.5)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next function will allow us to recursively go from the leaf node to the root and get all the binary splits.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_decision_path_rec(node_id, decision_path, hierarchy):\n  if node_id == 0:\n    yield decision_path \n  else:\n    parent_id = hierarchy[node_id]['parent']\n    condition = hierarchy[node_id]['condition']\n    for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):\n        yield res\n\ndecision_path = list(get_decision_path_rec(12, [], hierarchy))[0]\nprint(decision_path) \n# [(0, 'right')]\n\nfmt_decision_path = list(map(\n  lambda x: get_condition(x[0], x[1], features, thresholds, feature_names), \n  decision_path))\nprint(fmt_decision_path)\n# [('poutcome_success', '&gt;', 0.5)]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s save the logic of executing the recursion and formatting into a wrapper function.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_decision_path(node_id, features, thresholds, hierarchy, feature_names):\n  decision_path = list(get_decision_path_rec(node_id, [], hierarchy))[0]\n  return list(map(lambda x: get_condition(x[0], x[1], features, thresholds, \n    feature_names), decision_path))<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We\u2019ve learned how to get each node\u2019s binary split conditions. The only remaining logic is to combine the conditions.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_decision_path_string(node_id, features, thresholds, hierarchy, \n  feature_names):\n  conditions_df = pd.DataFrame(get_decision_path(node_id, features, thresholds, hierarchy, feature_names))\n  conditions_df.columns = ['feature', 'condition', 'threshold']\n\n  left_conditions_df = conditions_df[conditions_df.condition == '&lt;=']\n  right_conditions_df = conditions_df[conditions_df.condition == '&gt;']\n\n  # deduplication \n  left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()\n  right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()\n  \n  # concatination\n  fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])\n      .sort_values(['feature', 'condition'], ascending = False)\n  \n  # formatting \n  fin_conditions_df['cond_string'] = list(map(\n      lambda x, y, z: '(%s %s %.2f)' % (x, y, z),\n      fin_conditions_df.feature,\n      fin_conditions_df.condition,\n      fin_conditions_df.threshold\n  ))\n  return ' and '.join(fin_conditions_df.cond_string.values)\n\nprint(get_decision_path_string(12, features, thresholds, hierarchy, \n  feature_names))\n# (poutcome_success &gt; 0.50)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now, we can calculate the conditions for each group.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">leaves_df['condition'] = leaves_df['node_id'].map(\n  lambda x: get_decision_path_string(x, features, thresholds, hierarchy, \n  feature_names)\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The last step is to add their size and conversion to the groups.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])\nleaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100\nleaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total)\n  .map(lambda x: int(round(x\/100)))\nleaves_df['share_of_converted'] = 100*leaves_df['converted_users']\/leaves_df['converted_users'].sum()\nleaves_df['share_of_total'] = 100*leaves_df['total']\/leaves_df['total'].sum()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now, we can use these rules to make decisions. We can sort groups by conversion (probability of successful contact) and pick the customers with the highest probability.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">leaves_df.sort_values('conversion', ascending = False)\n  .drop('node_id', axis = 1).set_index('condition')<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"305\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-05-at-23.01.03-1024x305.png?resize=1024%2C305&#038;ssl=1\" alt=\"\" class=\"wp-image-601345\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Imagine we have resources to contact only around 10% of our user base, we can focus on the first three groups. Even with such a limited capacity, we would expect to get almost 40% conversion\u200a\u2014\u200ait\u2019s a really good result, and we\u2019ve achieved it with just a bunch of straightforward heuristics. \u00a0<\/p>\n<p class=\"wp-block-paragraph\">In real life, it\u2019s also worth testing the model (or heuristics) before deploying it in production. I would split the training dataset into training and validation parts (by time to avoid leakage) and see the heuristics performance on the validation set to have a better view of the actual model quality.<\/p>\n<h2 class=\"wp-block-heading\">Working with high cardinality categories<\/h2>\n<p class=\"wp-block-paragraph\">Another topic that is worth discussing in this context is category encoding, since we have to encode the categorical variables for <code>sklearn<\/code> implementation. We\u2019ve used a straightforward approach with one-hot encoding, but in some cases, it doesn\u2019t work.<\/p>\n<p class=\"wp-block-paragraph\">Imagine we also have a region in the data. I\u2019ve synthetically generated English cities for each row. We have 155 unique regions, so the number of features has increased to 190.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)\nmodel.fit(fin_df[feature_names], fin_df['y'])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">So, the basic tree now has lots of conditions based on regions and it\u2019s not convenient to work with them.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"657\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image2-1024x657.png?resize=1024%2C657&#038;ssl=1\" alt=\"\" class=\"wp-image-601346\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In such a case, it might not be meaningful to explode the number of features, and it\u2019s time to think about encoding. There\u2019s a comprehensive article, <a href=\"https:\/\/medium.com\/data-science-at-microsoft\/categorically-dont-explode-encode-dd623b565ce3\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cCategorically: Don\u2019t explode\u200a\u2014\u200aencode!\u201d<\/a>, that shares a bunch of different options to handle high cardinality categorical variables. I think the most feasible ones in our case will be the following two options:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Count or Frequency Encoder<\/strong> that shows good performance in benchmarks. This encoding assumes that categories of similar size would have similar characteristics.\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Target Encoder,<\/strong> where we can encode the category by the mean value of the target variable. It will allow us to prioritise segments with higher conversion and deprioritise segments with lower. Ideally, it would be nice to use historical data to get the averages for the encoding, but we will use the existing dataset.\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">However, it will be interesting to test different approaches, so let\u2019s split our dataset into train and test, saving 10% for validation. For simplicity, I\u2019ve used one-hot encoding for all columns except for region (since it has the highest cardinality).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sklearn.model_selection import train_test_split\nfin_df = pd.get_dummies(df, columns=['job', 'marital', 'education', \n  'poutcome', 'contact'], dtype = int, drop_first = False)\ntrain_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)\nprint(train_df.shape[0], test_df.shape[0])\n# (40689, 4522)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">For convenience, let\u2019s combine all the logic for parsing the tree into one function.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_model_definition(model, feature_names):\n  n_nodes = model.tree_.node_count\n  children_left = model.tree_.children_left\n  children_right = model.tree_.children_right\n  features = model.tree_.feature\n  thresholds = model.tree_.threshold\n  num_nodes = model.tree_.n_node_samples\n  values = model.tree_.value\n\n  hierarchy = {}\n\n  for node_id in range(n_nodes):\n      if children_left[node_id] != -1: \n          hierarchy[children_left[node_id]] = {\n            'parent': node_id, \n            'condition': 'left'\n          }\n    \n      if children_right[node_id] != -1:\n            hierarchy[children_right[node_id]] = {\n             'parent': node_id, \n             'condition': 'right'\n            }\n\n  leaves = []\n  for node_id in range(n_nodes):\n      if (children_left[node_id] == -1) and (children_right[node_id] == -1):\n          leaves.append(node_id)\n  leaves_df = pd.DataFrame({'node_id': leaves})\n  leaves_df['condition'] = leaves_df['node_id'].map(\n    lambda x: get_decision_path_string(x, features, thresholds, hierarchy, feature_names)\n  )\n\n  leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])\n  leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100\n  leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total).map(lambda x: int(round(x\/100)))\n  leaves_df['share_of_converted'] = 100*leaves_df['converted_users']\/leaves_df['converted_users'].sum()\n  leaves_df['share_of_total'] = 100*leaves_df['total']\/leaves_df['total'].sum()\n  leaves_df = leaves_df.sort_values('conversion', ascending = False)\n    .drop('node_id', axis = 1).set_index('condition')\n  leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()\n  leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()\n  return leaves_df<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s create an encodings data frame, calculating frequencies and conversions.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">region_encoding_df = train_df.groupby('region', as_index = False)\n  .aggregate({'id': 'count', 'y': 'mean'}).rename(columns = \n    {'id': 'region_count', 'y': 'region_target'})<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Then, merge it into our training and validation sets. For the validation set, we will also fill NAs as averages.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">train_df = train_df.merge(region_encoding_df, on = 'region')\n\ntest_df = test_df.merge(region_encoding_df, on = 'region', how = 'left')\ntest_df['region_target'] = test_df['region_target']\n  .fillna(region_encoding_df.region_target.mean())\ntest_df['region_count'] = test_df['region_count']\n  .fillna(region_encoding_df.region_count.mean())<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now, we can fit the models and get their structures.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">count_feature_names = train_df.drop(\n  ['y', 'id', 'region_target', 'region'], axis = 1).columns\ntarget_feature_names = train_df.drop(\n  ['y', 'id', 'region_count', 'region'], axis = 1).columns\nprint(len(count_feature_names), len(target_feature_names))\n# (36, 36)\n\ncount_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, \n  min_impurity_decrease=0.001)\ncount_model.fit(train_df[count_feature_names], train_df['y'])\n\ntarget_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, \n  min_impurity_decrease=0.001)\ntarget_model.fit(train_df[target_feature_names], train_df['y'])\n\ncount_model_def_df = get_model_definition(count_model, count_feature_names)\ntarget_model_def_df = get_model_definition(target_model, target_feature_names)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s look at the structures and select the top categories up to 10\u201315% of our target audience. We can also apply these conditions to our validation sets to test our approach in practice.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s start with Count Encoder.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"514\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-20.51.47-1024x514.png?resize=1024%2C514&#038;ssl=1\" alt=\"\" class=\"wp-image-601347\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">count_selected_df = test_df[\n    (test_df.poutcome_success &gt; 0.50) | \n    ((test_df.poutcome_success &lt;= 0.50) &amp; (test_df.age &gt; 60.50)) | \n    ((test_df.region_count &gt; 3645.50) &amp; (test_df.region_count &lt;= 8151.50) &amp; \n         (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.contact_cellular &gt; 0.50) &amp; (test_df.age &lt;= 60.50))\n]\n\nprint(count_selected_df.shape[0], count_selected_df.y.sum())\n# (508, 227)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We can also see what regions have been selected, and it\u2019s only Manchester.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"73\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-21.03.08-1024x73.png?resize=1024%2C73&#038;ssl=1\" alt=\"\" class=\"wp-image-601348\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s continue with the Target encoding.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"489\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-22.52.56-1024x489.png?resize=1024%2C489&#038;ssl=1\" alt=\"\" class=\"wp-image-601352\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">target_selected_df = test_df[\n    ((test_df.region_target &gt; 0.21) &amp; (test_df.poutcome_success &gt; 0.50)) | \n    ((test_df.region_target &gt; 0.21) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &lt;= 6.50) &amp; (test_df.housing &lt;= 0.50) &amp; (test_df.contact_unknown &lt;= 0.50)) | \n    ((test_df.region_target &gt; 0.21) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &gt; 8.50) &amp; (test_df.housing &lt;= 0.50) \n         &amp; (test_df.contact_unknown &lt;= 0.50)) |\n    ((test_df.region_target &lt;= 0.21) &amp; (test_df.poutcome_success &gt; 0.50)) |\n    ((test_df.region_target &gt; 0.21) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &gt; 6.50) &amp; (test_df.month &lt;= 8.50) \n         &amp; (test_df.housing &lt;= 0.50) &amp; (test_df.contact_unknown &lt;= 0.50))\n]\n\nprint(target_selected_df.shape[0], target_selected_df.y.sum())\n# (502, 248)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">region_encoding_df[region_encoding_df.region_target &gt; 0.21]\n  .sort_values('region_count', ascending = False)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"674\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-21.06.51-1024x674.png?resize=1024%2C674&#038;ssl=1\" alt=\"\" class=\"wp-image-601350\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In our case, it doesn\u2019t impact much since the share of such small cities is low. However, if you have way more small categories, you might see significant drawbacks of overfitting. Target Encoding might be tricky at this point, so it\u2019s worth keeping an eye on the output of your model.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Luckily, there\u2019s an approach that can help you overcome this issue. Following the article <a href=\"https:\/\/towardsdatascience.com\/encoding-categorical-variables-a-deep-dive-into-target-encoding-2862217c2753\/\" rel=\"noreferrer noopener\" target=\"_blank\">\u201cEncoding Categorical Variables: A Deep Dive into Target Encoding\u201d<\/a>, we can add smoothing. The idea is to combine the group\u2019s conversion rate with the overall average: the larger the group, the more weight its data carries, while smaller segments will lean more towards the global average.<\/p>\n<p class=\"wp-block-paragraph\">First, I\u2019ve selected the parameters that make sense for our distribution, looking at a bunch of options. I chose to use the global average for the groups under 100 people. This part is a bit subjective, so use common sense and your knowledge about the business domain.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\nimport matplotlib.pyplot as plt\n\nglobal_mean = train_df.y.mean()\n\nk = 100\nf = 10\nsmooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })\nsmooth_df['smoothing'] = (1 \/ (1 + np.exp(-(smooth_df.region_count - k) \/ f)))\n\nax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)\nplt.xscale('log')\nplt.ylim([-.1, 1.1])\nplt.title('Smoothing')<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"296\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-22.34.45-1024x296.png?resize=1024%2C296&#038;ssl=1\" alt=\"\" class=\"wp-image-601351\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Then, we can calculate, based on the selected parameters, the smoothing coefficients and blended averages.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">region_encoding_df['smoothing'] = (1 \/ (1 + np.exp(-(region_encoding_df.region_count - k) \/ f)))\nregion_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target \n    + (1 - region_encoding_df.smoothing) * global_mean<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Then, we can fit another model with smoothed target category encoding.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">train_df = train_df.merge(region_encoding_df[['region', 'region_target']], \n  on = 'region')\ntest_df = test_df.merge(region_encoding_df[['region', 'region_target']], \n  on = 'region', how = 'left')\ntest_df['region_target'] = test_df['region_target']\n  .fillna(region_encoding_df.region_target.mean())\n\ntarget_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)\n  .columns\n\ntarget_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, \n  min_impurity_decrease=0.001)\ntarget_v2_model.fit(train_df[target_v2_feature_names], train_df['y'])\ntarget_v2_model_def_df = get_model_definition(target_v2_model, \n  target_v2_feature_names)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"532\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-22.28.50-1024x532.png?resize=1024%2C532&#038;ssl=1\" alt=\"\" class=\"wp-image-601349\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">target_v2_selected_df = test_df[\n    ((test_df.region_target &gt; 0.12) &amp; (test_df.poutcome_success &gt; 0.50)) | \n    ((test_df.region_target &gt; 0.12) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &lt;= 6.50) &amp; (test_df.housing &lt;= 0.50) &amp; (test_df.contact_unknown &lt;= 0.50)) | \n    ((test_df.region_target &gt; 0.12) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &gt; 8.50) &amp; (test_df.housing &lt;= 0.50) \n         &amp; (test_df.contact_unknown &lt;= 0.50)) | \n    ((test_df.region_target &lt;= 0.12) &amp; (test_df.poutcome_success &gt; 0.50) ) | \n    ((test_df.region_target &gt; 0.12) &amp; (test_df.poutcome_success &lt;= 0.50) &amp; (test_df.month &gt; 6.50) &amp; (test_df.month &lt;= 8.50) \n         &amp; (test_df.housing &lt;= 0.50) &amp; (test_df.contact_unknown &lt;= 0.50) )\n]\n\ntarget_v2_selected_df.shape[0], target_v2_selected_df.y.sum()\n# (500, 247)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We can see that we\u2019ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">region_encoding_df[region_encoding_df.region_target &gt; 0.12]<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"184\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Screenshot-2025-02-07-at-22.30.29-1024x184.png?resize=1024%2C184&#038;ssl=1\" alt=\"\" class=\"wp-image-601353\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can also use <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.TargetEncoder.html\" rel=\"noreferrer noopener\" target=\"_blank\">TargetEncoder<\/a> from <code>sklearn<\/code>, which smoothes and mixes the category and global means depending on the segment size. However, it also adds random noise, which is not ideal for our case of heuristics.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>You can find the full code on <a href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/mining_rules\/churn_prediction.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n<p class=\"wp-block-paragraph\">In this article, we explored how to extract simple \u201crules\u201d from data and use them to inform business decisions. We generated heuristics using a Decision Tree Classifier and touched on the important topic of categorical encoding since decision tree algorithms require categorical variables to be converted.<\/p>\n<p class=\"wp-block-paragraph\">We saw that this rule-based approach can be surprisingly effective, helping you reach business decisions quickly. However, it\u2019s worth noting that this simplistic approach has its drawbacks:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We are trading off the model\u2019s power and accuracy for its simplicity and interpretability, so if you\u2019re optimising for accuracy, choose another approach.<\/li>\n<li class=\"wp-block-list-item\">Even though we\u2019re using a set of static heuristics, your data still can change, and they might become outdated, so you need to recheck your model from time to time.\u00a0<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<p class=\"wp-block-paragraph\"><em>Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.<\/em><\/p>\n<h2 class=\"wp-block-heading\">Reference<\/h2>\n<p class=\"wp-block-paragraph\"><strong>Dataset: <\/strong><em>Moro, S., Rita, P., &amp; Cortez, P. (2014). Bank Marketing [Dataset]. UCI <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> Repository. <\/em><a href=\"https:\/\/doi.org\/10.24432\/C5K306\" rel=\"noreferrer noopener\" target=\"_blank\"><em>https:\/\/doi.org\/10.24432\/C5K306<\/em><\/a><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/mining-rules-from-data\/\">Mining Rules from\u00a0Data<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Mariya Mansurova<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/mining-rules-from-data\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mining Rules from\u00a0Data Working with products, we might face a need to introduce some \u201crules\u201d. Let me explain what I mean by \u201crules\u201d in practical examples:\u00a0 Imagine that we\u2019re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,2334,83,1058,240,70,157],"tags":[214,2335,2336],"class_list":["post-2996","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-mining","category-data-science","category-decision-tree","category-editors-pick","category-machine-learning","category-python","tag-customers","tag-might","tag-rules"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2996"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2996"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2996\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2996"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2996"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2996"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}