{"id":1558,"date":"2025-01-31T07:02:41","date_gmt":"2025-01-31T07:02:41","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/31\/data-pruning-mnist-how-i-hit-99-accuracy-using-half-the-data-9179a8fb4521\/"},"modified":"2025-01-31T07:02:41","modified_gmt":"2025-01-31T07:02:41","slug":"data-pruning-mnist-how-i-hit-99-accuracy-using-half-the-data-9179a8fb4521","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/31\/data-pruning-mnist-how-i-hit-99-accuracy-using-half-the-data-9179a8fb4521\/","title":{"rendered":"Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data"},"content":{"rendered":"<p>    Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>How much data does AI really\u00a0need?<\/h4>\n<p><strong>TLDR<\/strong>: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST\u00b9 to classify handwritten digits.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AufEKI8iLqRID92ATvTwplw.png?ssl=1\"><figcaption>Best runs for \u201cfurthest-from-centroid\u201d selection compared to full dataset. Image by\u00a0author.<\/figcaption><\/figure>\n<p>What if I told you that using just 50% of your training data could achieve better results than using the full\u00a0dataset?<\/p>\n<p>In my recent experiments with the MNIST dataset\u00b9, that\u2019s exactly what happened.<\/p>\n<p>Even more surprisingly, <strong>using just 10% of well selected data still achieved over 98% accuracy<\/strong>.<\/p>\n<h3>Data Pruning\u00a0Results<\/h3>\n<p>The plot above shows the model\u2019s accuracy compared to the training dataset size when using the most effective pruning method I\u00a0tested.<\/p>\n<ul>\n<li>Using 50% of the data with the \u201c<strong>furthest-from-centroid<\/strong>\u201d selection strategy achieved a <em>median<\/em> accuracy of 98.73%, slightly better than training on the full dataset (98.71%) with no data\u00a0pruning.<\/li>\n<li>Even with just 10% of the data, the best run using \u201cfurthest-from-centroid\u201d had 98.2% accuracy.<\/li>\n<li>Random sampling 10% of the dataset still hit 97.59% accuracy on it\u2019s best run, without any selection strategy.<\/li>\n<\/ul>\n<p>Overall, I found this fascinating. But it really leads me to question\u200a\u2014\u200a<strong>how much data does AI really\u00a0need?<\/strong><\/p>\n<h3>What is Furthest-from-Centroid?<\/h3>\n<p>I tested several data pruning strategies.<\/p>\n<p>The best-performing strategy was surprisingly simple:<\/p>\n<ol>\n<li>Group similar images into clusters using\u00a0k-means.<\/li>\n<li>For each cluster, find its center point (centroid).<\/li>\n<li>Select the images that are furthest from their cluster\u2019s center.<\/li>\n<\/ol>\n<p>Think of each cluster as a group of similar-looking digits.<\/p>\n<p>Instead of picking the most \u201ctypical\u201d digit from each group, this method picks the unusual ones\u200a\u2014\u200athe digits that are still recognizable but written in unique\u00a0ways.<\/p>\n<p>These outliers help the model learn more robust decision boundaries.<\/p>\n<h4>Why Furthest-from-Centroid Works<\/h4>\n<ol>\n<li>\n<strong>Information Gain<\/strong>: Each selected example provides unique information about the decision boundary.<\/li>\n<li>\n<strong>Diversity<\/strong>: Captures varied writing styles and edge\u00a0cases.<\/li>\n<li>\n<strong>Reduced Redundancy<\/strong>: Eliminates nearly identical examples that don\u2019t add new information.<\/li>\n<\/ol>\n<p>When you have abundant data, the marginal value of another typical example is low. Instead, focusing on boundary cases helps define decision boundaries better.<\/p>\n<p>Here\u2019s what the \u201cfurthest-from-centroid\u201d samples look like compared to typical examples:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/752\/1%2AkJnLfJklfNS-V82v-tZHmw.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/752\/1%2AkvBBwR4jmORAzOmW9V_hWg.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/752\/1%2A3Ei-C9DgbyHOMa_0sYf1_A.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/752\/1%2AllFb8pzX0CmIlfmnpy9IiA.png?ssl=1\"><figcaption>Centroid Image and Furthest-from-centroid Image of various clusters. Images from <a href=\"https:\/\/huggingface.co\/datasets\/ylecun\/mnist\">the MNIST dataset<\/a>, reproduced by the\u00a0author.<\/figcaption><\/figure>\n<p>Notice how the selected samples capture more varied writing styles and edge\u00a0cases.<\/p>\n<p>In some examples like cluster 1, 3, and 8 the furthest point does just look like a more varied example of the prototypical center.<\/p>\n<p>Cluster 6 is an interesting point, showcasing how some images are difficult even for a human to guess what it is. But you can still make out how this could be in a cluster with the centroid as an\u00a08.<\/p>\n<h3>The Theory Behind Data\u00a0Pruning<\/h3>\n<p>Recent research on <a href=\"https:\/\/arxiv.org\/abs\/2206.14486\">neural scaling laws<\/a> helps to explain why data pruning using a \u201cfurthest-from-centroid\u201d approach works, especially on the MNIST\u00a0dataset.<\/p>\n<h4>Data Redundancy<\/h4>\n<p>Many training examples in large datasets are highly redundant.<\/p>\n<p>Think about MNIST: how many nearly identical \u20187\u2019s do we really need? The key to data pruning isn\u2019t having more examples\u200a\u2014\u200ait\u2019s having the right examples.<\/p>\n<h4>Selection Strategy vs Dataset\u00a0Size<\/h4>\n<p>One of the most interesting findings from the above paper is how the optimal data selection strategy changes based on your dataset\u00a0size:<\/p>\n<ul>\n<li>\n<strong>With \u201ca lot\u201d of data<\/strong>\u00a0: <em>Select harder, more diverse<\/em> examples (furthest from cluster centers).<\/li>\n<li>\n<strong>With scarce data:<\/strong><em> Select easier, more typical<\/em><strong><em> <\/em><\/strong>examples (closest to cluster centers).<\/li>\n<\/ul>\n<p>This explains why our \u201cfurthest-from-centroid\u201d strategy worked so\u00a0well.<\/p>\n<p>With MNIST\u2019s 60,000 training examples, we were in the \u201cabundant data\u201d regime where selecting diverse, challenging examples proved most beneficial.<\/p>\n<h3>The Full Experiment<\/h3>\n<h4>Inspiration and\u00a0Goals<\/h4>\n<p>I was inspired by these two recent papers (and the fact that I\u2019m a data engineer):<\/p>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2206.14486\">Beyond neural scaling laws: beating power law scaling via data\u00a0pruning<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2304.08442\">The MiniPile Challenge for Data-Efficient Language\u00a0Models<\/a><\/li>\n<\/ul>\n<p>Both explore various ways we can use data selection strategies to train performant models on less\u00a0data.<\/p>\n<h4>Methodology<\/h4>\n<p>I used <a href=\"https:\/\/en.wikipedia.org\/wiki\/LeNet\">LeNet-5<\/a> as my model architecture.<\/p>\n<p>Then using one of the strategies below I pruned the training dataset of MNIST and trained a model. Testing was done against the full test\u00a0set.<\/p>\n<p>Due to time constraints, I only ran 5 tests per experiment.<\/p>\n<p><strong><em>Full code and results <\/em><\/strong><a href=\"https:\/\/github.com\/bitsofchris\/deep-learning\/tree\/main\/code\/05_data-pruning-mnist-image-classification\"><strong><em>available here on\u00a0GitHub<\/em><\/strong><\/a><strong><em>.<\/em><\/strong><\/p>\n<p><strong>Strategy #1: Baseline, Full\u00a0Dataset<\/strong><\/p>\n<ul>\n<li>Standard LeNet-5 architecture<\/li>\n<li>Trained using 100% of training\u00a0data<\/li>\n<\/ul>\n<p><strong>Strategy #2: Random\u00a0Sampling<\/strong><\/p>\n<ul>\n<li>Randomly sample individual images from the training\u00a0dataset<\/li>\n<\/ul>\n<p><strong>Strategy #3: K-means Clustering with Different Selection Strategies<\/strong><\/p>\n<p>Here\u2019s how this\u00a0worked:<\/p>\n<ol>\n<li>Preprocess the images with PCA to reduce the dimensionality. This just means each image was reduced from 784 values (28&#215;28 pixels) into only 50 values. PCA does this while retaining the most important patterns and removing redundant information.<\/li>\n<li>Cluster using k-means. The number of clusters was fixed at 50 and 500 in different tests. My poor CPU couldn\u2019t handle much beyond 500 given all the experiments.<\/li>\n<li>I then tested different selection methods once the data was\u00a0cluster:<\/li>\n<\/ol>\n<ul>\n<li>Closest-to-centroid\u200a\u2014\u200athese represent a \u201ctypical\u201d example of the\u00a0cluster.<\/li>\n<li>Furthest-from-centroid\u200a\u2014\u200amore representative of edge\u00a0cases.<\/li>\n<li>Random from each cluster\u200a\u2014\u200arandomly select within each\u00a0cluster.<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/544\/1%2AmPUJNBxiiC57cOstULwwaA.png?ssl=1\"><figcaption>Example of Clustering Selection. Image by\u00a0author.<\/figcaption><\/figure>\n<h3>Technical Implementation Lessons\u00a0Learned<\/h3>\n<ul>\n<li>PCA reduced noise and computation time. At first I was just flattening the images. The results and compute both improved using PCA so I kept it for the full experiment.<\/li>\n<li>I switched from standard K-means to MiniBatchKMeans clustering for better speed. The standard algorithm was too slow for my CPU given all the\u00a0tests.<\/li>\n<li>Setting up a proper test harness was key. Moving experiment configs to a YAML, automatically saving results to a file, and having o1 write my visualization code made life much\u00a0easier.<\/li>\n<\/ul>\n<h3>Full Results<\/h3>\n<h4>Median Accuracy &amp; Run\u00a0Time<\/h4>\n<p>Here are the median results, comparing our baseline LeNet-5 trained on the full dataset with two different strategies that used 50% of the\u00a0dataset.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/492\/1%2A9VAkwaPAvZPhUOrStv-j-Q.png?ssl=1\"><figcaption>Median Results. Image by\u00a0author.<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2An78hMKTWOxjJt25Z8uLfZw.png?ssl=1\"><figcaption>Median Accuracies. Image by\u00a0author.<\/figcaption><\/figure>\n<h4>Accuracy vs Run Time Full\u00a0Results<\/h4>\n<p>The below charts show the results of my four pruning strategies compared to the baseline in\u00a0red.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A-vlcZwCOQM-3Zsi-3w4hJw.png?ssl=1\"><figcaption>Median Accuracy across Data Pruning methods. Image by\u00a0author.<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AANSq5Wj2bgPcI00-vDbi8g.png?ssl=1\"><figcaption>Median Run time across Data Pruning methods. Image by\u00a0author.<\/figcaption><\/figure>\n<p>Key findings across multiple\u00a0runs:<\/p>\n<ul>\n<li>Furthest-from-centroid consistently outperformed other\u00a0methods<\/li>\n<li>There definitely is a sweet spot between compute time and and model accuracy if you want to find it for your use case. More work needs to be done\u00a0here.<\/li>\n<\/ul>\n<p>I\u2019m still shocked that just randomly reducing the dataset gives acceptable results if efficiency is what you\u2019re\u00a0after.<\/p>\n<h3>Next Steps<\/h3>\n<h4>Future Plans<\/h4>\n<ol>\n<li>Test this on<a href=\"https:\/\/medium.com\/@BitsOfChris\/i-trained-a-local-llm-on-my-obsidian-heres-what-i-learned-a3e738f9bed0\"> my second brain<\/a>. I want to fine tune a LLM on my full Obsidian and test data pruning along with hierarchical summarization.<\/li>\n<li>Explore other embedding methods for clustering. I can try training an auto-encoder to embed the images rather than use\u00a0PCA.<\/li>\n<li>Test this on more complex and larger datasets (CIFAR-10, ImageNet).<\/li>\n<li>Experiment with how model architecture impacts the performance of data pruning strategies.<\/li>\n<\/ol>\n<h3>Conclusion<\/h3>\n<p>These findings suggest we need to rethink our approach to dataset curation:<\/p>\n<ol>\n<li>More data isn\u2019t always better\u200a\u2014\u200athere seems to be diminishing returns to bigger data\/ bigger\u00a0models.<\/li>\n<li>Strategic pruning can actually improve\u00a0results.<\/li>\n<li>The optimal strategy depends on your starting dataset\u00a0size.<\/li>\n<\/ol>\n<p>As people start sounding the alarm that we are running out of data, I can\u2019t help but wonder if less data is actually the key to useful, cost-effective models.<\/p>\n<p>I intend to continue exploring the space, please reach out if you find this interesting\u200a\u2014\u200ahappy to connect and talk more\u00a0\ud83d\ude42<\/p>\n<p><em>I\u2019m a Staff Data Engineer working on an Applied AI Research team building foundational time series\u00a0models.<\/em><\/p>\n<p><em>If you\u2019d like to follow my work work or get in touch head to <\/em><a href=\"https:\/\/bitsofchris.com\/about\"><em>my\u00a0blog<\/em><\/a><em>.<\/em><\/p>\n<h3>References<\/h3>\n<ol>\n<li>LeCun, Y., Cortes, C., &amp; Burges, C. J. (2010). MNIST handwritten digit database. ATT Labs [Online]. Available: <a href=\"http:\/\/yann.lecun.com\/exdb\/mnist\">http:\/\/yann.lecun.com\/exdb\/mnist<\/a> or <a href=\"https:\/\/huggingface.co\/datasets\/ylecun\/mnist\">https:\/\/huggingface.co\/datasets\/ylecun\/mnist<\/a>\n<\/li>\n<\/ol>\n<p>The MNIST dataset is used under the Creative Commons Attribution-Share Alike 3.0\u00a0license.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=9179a8fb4521\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/data-pruning-mnist-how-i-hit-99-accuracy-using-half-the-data-9179a8fb4521\">Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Chris Lettieri<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fdata-pruning-mnist-how-i-hit-99-accuracy-using-half-the-data-9179a8fb4521\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really\u00a0need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST\u00b9 to classify handwritten digits. Best runs for \u201cfurthest-from-centroid\u201d selection compared to full dataset. Image by\u00a0author. What if I told you [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,401,83,70,673],"tags":[84,1556,878],"class_list":["post-1558","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-data-engineering","category-data-science","category-machine-learning","category-neural-networks","tag-data","tag-pruning","tag-using"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1558"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1558"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1558\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1558"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1558"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1558"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}