{"id":587,"date":"2024-12-16T07:04:04","date_gmt":"2024-12-16T07:04:04","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/16\/credit-card-fraud-detection-with-different-sampling-techniques-cece7734acc5\/"},"modified":"2024-12-16T07:04:04","modified_gmt":"2024-12-16T07:04:04","slug":"credit-card-fraud-detection-with-different-sampling-techniques-cece7734acc5","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/16\/credit-card-fraud-detection-with-different-sampling-techniques-cece7734acc5\/","title":{"rendered":"Credit Card Fraud Detection with Different Sampling Techniques"},"content":{"rendered":"<p>    Credit Card Fraud Detection with Different Sampling Techniques<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>How to deal with imbalanced data<\/h4>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*prnwY4KmL_7ov91a\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@bermixstudio?utm_source=medium&amp;utm_medium=referral\">Bermix Studio<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>Credit card fraud detection is a plague that all financial institutions are at risk with. In general fraud detection is very challenging because fraudsters are coming up with new and innovative ways of detecting fraud, so it is difficult to find a pattern that we can detect. For example, in the diagram all the icons look the same, but there one icon that is slightly different from the rest and we have pick that one. Can you spot\u00a0it?<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/418\/1%2AGuAaF76hzU0IF7Q2p4p-7A.png?ssl=1\"><\/figure>\n<p>Here it\u00a0is:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/421\/1%2Am8ezQ3qUiqCwx1f4vKJ1cw.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>With this background let me provide a plan for today and what you will learn in the context of our use case \u2018Credit Card Fraud Detection\u2019:<\/p>\n<p>1. What is data imbalance<\/p>\n<p>2. Possible causes of data Imbalance<\/p>\n<p>3. Why is class imbalance a problem in machine\u00a0learning<\/p>\n<p>4. Quick Refresher on Random Forest Algorithm<\/p>\n<p>5. Different sampling methods to deal with data Imbalance<\/p>\n<p>6. Comparison of which method works well in our context with a practical Demonstration with\u00a0Python<\/p>\n<p>7. Business insight on which model to choose and\u00a0why?<\/p>\n<p>In most cases, because the number of fraudulent transactions is not a huge number, we have to work with a data that typically has a lot of non-frauds compared to Fraud cases. In technical terms such a dataset is called an \u2018imbalanced data\u2019. But, it is still essential to detect the fraud cases, because only 1 fraudulent transaction can cause millions of losses to banks\/financial institutions. Now, let us delve deeper into what is data imbalance.<\/p>\n<p>We will be considering the credit card fraud dataset from <a href=\"https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud\">https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud<\/a> (Open Data License).<\/p>\n<h3>1. Data Imbalance<\/h3>\n<p>Formally this means that the distribution of samples across different classes is unequal. In our case of binary classification problem, there are 2\u00a0classes<\/p>\n<p>a) Majority class\u2014the non-fraudulent\/genuine transactions<\/p>\n<p>b) Minority class\u2014the fraudulent transactions<\/p>\n<p>In the dataset considered, the class distribution is as follows (Table\u00a01):<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/302\/1%2A1Pa98yA8owACfE30F1qpnQ.png?ssl=1\"><figcaption>Table 1: Class Distribution (By\u00a0Author)<\/figcaption><\/figure>\n<p><em>As we can observe, the dataset is highly imbalanced with only 0.17% of the observations being in the Fraudulent category.<\/em><\/p>\n<h3>2. Possible causes of Data Imbalance<\/h3>\n<p>There can be 2 main causes of data imbalance:<\/p>\n<p>a) Biased Sampling\/Measurement errors: This is due to collection of samples only from one class or from a particular region or samples being mis-classified. This can be resolved by improving the sampling\u00a0methods<\/p>\n<p>b) Use case\/domain characteristic: A more pertinent problem as in our case might be due to the problem of prediction of a rare event, which automatically introduces skewness towards majority class because the occurrence of minor class is practice is not\u00a0often.<\/p>\n<h3>3. Why is class imbalance a problem in machine-learning?<\/h3>\n<p>This is a problem because most of the algorithms in machine learning focus on learning from the occurrences that occur frequently i.e. the majority class. This is called the frequency bias. So in cases of imbalanced dataset, these algorithms might not work well. Typically few techniques that will work well are tree based algorithms or anomaly detection algorithms. Traditionally, in fraud detection problems business rule based methods are often used. Tree-based methods work well because a tree creates rule-based hierarchy that can separate both the classes. Decision trees tend to over-fit the data and to eliminate this possibility we will go with an ensemble method. For our use case, we will use the Random Forest Algorithm today.<\/p>\n<h3>4. A quick Refresher on Random Forest Algorithm<\/h3>\n<p>Random Forest works by building multiple decision tree predictors and the mode of the classes of these individual decision trees is the final selected class or output. It is like voting for the most popular class. For example: If 2 trees predict that Rule 1 indicates Fraud while another tree indicates that Rule 1 predicts Non-fraud, then according to Random forest algorithm the final prediction will be\u00a0Fraud.<\/p>\n<blockquote><p>Formal Definition: A random forest is a classifier consisting of a collection of tree-structured classifiers {h(x,\u0398k ), k=1,\u00a0\u2026} where the {\u0398k} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x\u00a0.\u00a0(<a href=\"https:\/\/www.stat.berkeley.edu\/~breiman\/randomforest2001.pdf\">Source<\/a>)<\/p><\/blockquote>\n<p>Each tree depends on a random vector that is independently sampled and all trees have a similar distribution. The generalization error converges as the number of trees increases. In its splitting criteria, Random forest searches for the best feature among a random subset of features and we can also compute variable importance and accordingly do feature selection. The trees can be grown using bagging technique where observations can be random selected (without replacement) from the training set. The other method can be random split selection where a random split is selected from K-best splits at each\u00a0node.<\/p>\n<p>You can read more about it\u00a0<a href=\"https:\/\/www.stat.berkeley.edu\/~breiman\/randomforest2001.pdf\">here<\/a><\/p>\n<h3>5. Sampling methods to deal with Data Imbalance<\/h3>\n<p>We will now illustrate 3 sampling methods that can take care of data imbalance.<\/p>\n<p>a) <strong>Random Under-sampling<\/strong>: Random draws are taken from the non-fraud observations i.e the majority class to match it with the Fraud observations ie the minority class. This means, we are throwing away some information from the dataset which might not be ideal\u00a0always.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/611\/1%2AACQmCz79mMlw0NapBbkBGA.png?ssl=1\"><figcaption>Fig 1: Random Under-sampling (Image By\u00a0Author)<\/figcaption><\/figure>\n<p>b) <strong>Random Over-sampling<\/strong>: In this case, we do exact opposite of under-sampling i.e duplicate the minority class i.e Fraud observations at random to increase the number of the minority class till we get a balanced dataset. Possible limitation is we are creating a lot of duplicates with this\u00a0method.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/801\/1%2A0Pg--Ml5inAkc76AeJn3RA.png?ssl=1\"><figcaption>Fig 2: Random Over-sampling (Image By\u00a0Author)<\/figcaption><\/figure>\n<p>c) SMOTE: (Synthetic Minority Over-sampling technique) is another method that uses synthetic data with KNN instead of using duplicate data. Each minority class example along with their k-nearest neighbours is considered. Then along the line segments that join any\/all the minority class examples and k-nearest neighbours synthetic examples are created. This is illustrated in the Fig 3\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/641\/1%2AEAVoNiKnLfhA52GgoMS6Xg.png?ssl=1\"><figcaption>Fig 3: SMOTE (Image By\u00a0Author)<\/figcaption><\/figure>\n<p>With only over-sampling, the decision boundary becomes smaller while with SMOTE we can create larger decision regions thereby improving the chance of capturing the minority class\u00a0better.<\/p>\n<p>One possible limitation is, if the minority class i.e fraudulent observations is spread throughout the data and not distinct then using nearest neighbours to create more fraud cases, introduces noise into the data and this can lead to mis-classification.<\/p>\n<h3>6. Quick refresher on Accuracy, Recall, Precision<\/h3>\n<p>Some of the metrics that is useful for judging the performance of a model are listed below. These metrics provide a view how well\/how accurately the model is able to predict\/classify the target variable\/s:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/673\/1%2AV0iizqFrxT9cSneMzrf2bQ.png?ssl=1\"><figcaption>Fig 3: Classification Matrix (Image By\u00a0Author)<\/figcaption><\/figure>\n<p>\u00b7 TP (True positive)\/TN (True negative) are the cases of correct predictions i.e predicting Fraud cases as Fraud (TP) and predicting non-fraud cases as non-fraud (TN)<\/p>\n<p>\u00b7 FP (False positive) are those cases that are actually non-fraud but model predicts as\u00a0Fraud<\/p>\n<p>\u00b7 FN (False negative) are those cases that are actually fraud but model predicted as non-Fraud<\/p>\n<blockquote><p>Precision = TP \/ (TP + FP): Precision measures how accurately model is able to capture fraud i.e out of the total predicted fraud cases, how many actually turned out to be\u00a0fraud.<\/p><\/blockquote>\n<blockquote><p>Recall = TP\/ (TP+FN): Recall measures out of all the actual fraud cases, how many the model could predict correctly as fraud. This is an important metric\u00a0here.<\/p><\/blockquote>\n<blockquote><p>Accuracy = (TP +TN)\/(TP+FP+FN+TN): Measures how many majority as well as minority classes could be correctly classified.<\/p><\/blockquote>\n<blockquote><p>F-score = 2*TP\/ (2*TP + FP +FN) = 2* Precision *Recall\/ (Precision *Recall)\u00a0; This is a balance between precision and recall. Note that precision and recall are inversely related, hence F-score is a good measure to achieve a balance between the\u00a0two.<\/p><\/blockquote>\n<h3>7. Comparison of which method works well with a practical demonstration with\u00a0Python<\/h3>\n<p>First, we will train the random forest model with some default features. Please note optimizing the model with feature selection or cross validation has been kept out-of-scope here for sake of simplicity. Post that we train the model using under-sampling, oversampling and then SMOTE. The table below illustrates the confusion matrix along with the precision, recall and accuracy metrics for each\u00a0method.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/569\/1%2AE06Gj0rLz9L8BEE1NveFdw.png?ssl=1\"><figcaption>Table 2: Model results comparison (By\u00a0Author)<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/668\/1%2AMf4tjVYgpCZHA9qVKsi7vA.png?ssl=1\"><\/figure>\n<p>a) <strong>No sampling result interpretation:<\/strong> Without any sampling we are able to capture 76 fraudulent transactions. Though the overall accuracy is 97%, the recall is 75%. This means that there are quite a few fraudulent transactions that our model is not able to\u00a0capture.<\/p>\n<p>Below is the code that can be used\u00a0:<\/p>\n<pre># Training the model<br>from sklearn.ensemble import RandomForestClassifier<br>classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)<br>classifier.fit(x_train,y_train)<br><br># Predict Y on the test set<br>y_pred = classifier.predict(x_test)<br><br># Obtain the results from the classification report and confusion matrix <br>from sklearn.metrics import  classification_report, confusion_matrix <br><br>print('Classifcation report:n', classification_report(y_test, y_pred))<br>conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)<br>print('Confusion matrix:n', conf_mat)<\/pre>\n<p>b) <strong>Under-sampling result interpretation<\/strong>: With under-sampling\u00a0, though the model is able to capture 90 fraud cases with significant improvement in recall, the accuracy and precision falls drastically. This is because the false positives have increased phenomenally and the model is penalizing a lot of genuine transactions.<\/p>\n<p>Under-sampling code\u00a0snippet:<\/p>\n<pre># This is the pipeline module we need from imblearn<br>from imblearn.under_sampling import RandomUnderSampler<br>from imblearn.pipeline import Pipeline <br><br># Define which resampling method and which ML model to use in the pipeline<br>resampling = RandomUnderSampler()<br>model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)<br><br># Define the pipeline,and combine sampling method with the RF model<br>pipeline = Pipeline([('RandomUnderSampler', resampling), ('RF', model)])<br><br>pipeline.fit(x_train, y_train) <br>predicted = pipeline.predict(x_test)<br><br># Obtain the results from the classification report and confusion matrix <br>print('Classifcation report:n', classification_report(y_test, predicted))<br>conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)<br>print('Confusion matrix:n', conf_mat)<\/pre>\n<p>c) <strong>Over-sampling result interpretation<\/strong>: Over-sampling method has the highest precision and accuracy and the recall is also good at 81%. We are able to capture 6 more fraud cases and the false positives is pretty low as well. Overall, from the perspective of all the parameters, this model is a good\u00a0model.<\/p>\n<p>Oversampling code\u00a0snippet:<\/p>\n<pre># This is the pipeline module we need from imblearn<br>from imblearn.over_sampling import RandomOverSampler<br><br># Define which resampling method and which ML model to use in the pipeline<br>resampling = RandomOverSampler()<br>model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)<br><br># Define the pipeline,and combine sampling method with the RF model<br>pipeline = Pipeline([('RandomOverSampler', resampling), ('RF', model)])<br><br>pipeline.fit(x_train, y_train) <br>predicted = pipeline.predict(x_test)<br><br># Obtain the results from the classification report and confusion matrix <br>print('Classifcation report:n', classification_report(y_test, predicted))<br>conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)<br>print('Confusion matrix:n', conf_mat)<\/pre>\n<p>d) SMOTE: Smote further improves the over-sampling method with 3 more frauds caught in the net and though false positives increase a bit the recall is pretty healthy at\u00a084%.<\/p>\n<p>SMOTE code\u00a0snippet:<\/p>\n<pre># This is the pipeline module we need from imblearn<br><br>from imblearn.over_sampling import SMOTE<br><br><br># Define which resampling method and which ML model to use in the pipeline<br>resampling = SMOTE(sampling_strategy='auto',random_state=0)<br>model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)<br><br># Define the pipeline, tell it to combine SMOTE with the RF model<br>pipeline = Pipeline([('SMOTE', resampling), ('RF', model)])<br><br>pipeline.fit(x_train, y_train) <br>predicted = pipeline.predict(x_test)<br><br># Obtain the results from the classification report and confusion matrix <br>print('Classifcation report:n', classification_report(y_test, predicted))<br>conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)<br>print('Confusion matrix:n', conf_mat)<\/pre>\n<h3>Summary:<\/h3>\n<p>In our use case of fraud detection, the one metric that is most important is recall. This is because the banks\/financial institutions are more concerned about catching most of the fraud cases because fraud is expensive and they might lose a lot of money over this. Hence, even if there are few false positives i.e flagging of genuine customers as fraud it might not be too cumbersome because this only means blocking some transactions. However, blocking too many genuine transactions is also not a feasible solution, hence depending on the risk appetite of the financial institution we can go with either simple over-sampling method or SMOTE. We can also tune the parameters of the model, to further enhance the model results using grid\u00a0search.<\/p>\n<p>For details on the code refer to this link on\u00a0<a href=\"https:\/\/github.com\/Mythili7\/CC-Fraud-detection\/blob\/master\/Credit%20card%20Fraud%20Detection.ipynb\">Github<\/a>.<\/p>\n<h4>References:<\/h4>\n<p>[1] Mythili Krishnan, Madhan K. Srinivasan, <a href=\"https:\/\/www.researchgate.net\/publication\/357504636_Credit_Card_Fraud_Detection_An_Exploration_of_Different_Sampling_Methods_to_Solve_the_Class_Imbalance_Problem\">Credit Card Fraud Detection: An Exploration of Different Sampling Methods to Solve the Class Imbalance Problem<\/a> (2022), ResearchGate<\/p>\n<p>[1] Bartosz Krawczyk, <a href=\"https:\/\/link.springer.com\/article\/10.1007\/s13748-016-0094-0\">Learning from imbalanced data: open challenges and future directions<\/a> (2016),\u00a0Springer<\/p>\n<p>[2] Nitesh V. Chawla, Kevin W. Bowyer\u00a0, Lawrence O. Hall and W. Philip Kegelmeyer\u00a0, <a href=\"https:\/\/arxiv.org\/pdf\/1106.1813.pdf\">SMOTE: Synthetic Minority Over-sampling Technique<\/a> (2002), Journal of Artificial Intelligence research<\/p>\n<p>[3] Leo Breiman, <a href=\"https:\/\/www.stat.berkeley.edu\/~breiman\/randomforest2001.pdf\">Random Forests<\/a> (2001), stat.berkeley.edu<\/p>\n<p>[4] Jeremy Jordan, <a href=\"https:\/\/www.jeremyjordan.me\/imbalanced-data\/\">Learning from imbalanced data<\/a>\u00a0(2018)<\/p>\n<p>[5] <a href=\"https:\/\/trenton3983.github.io\/files\/projects\/2019-07-19_fraud_detection_python\/2019-07-19_fraud_detection_python.html\">https:\/\/trenton3983.github.io\/files\/projects\/2019-07-19_fraud_detection_python\/2019-07-19_fraud_detection_python.html<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=cece7734acc5\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/credit-card-fraud-detection-with-different-sampling-techniques-cece7734acc5\">Credit Card Fraud Detection with Different Sampling Techniques<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Mythili Krishnan<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcredit-card-fraud-detection-with-different-sampling-techniques-cece7734acc5\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Credit Card Fraud Detection with Different Sampling Techniques How to deal with imbalanced data Photo by Bermix Studio on\u00a0Unsplash Credit card fraud detection is a plague that all financial institutions are at risk with. In general fraud detection is very challenging because fraudsters are coming up with new and innovative ways of detecting fraud, so [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,703,83,705,704,212],"tags":[84,706,707],"class_list":["post-587","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-credit-cards","category-data-science","category-fraud-detection","category-imbalanced-data","category-sampling","tag-data","tag-fraud","tag-imbalance"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/587"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=587"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/587\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=587"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=587"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=587"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}