{"id":1849,"date":"2025-02-14T07:03:11","date_gmt":"2025-02-14T07:03:11","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-1-the-data\/"},"modified":"2025-02-14T07:03:11","modified_gmt":"2025-02-14T07:03:11","slug":"learnings-from-a-machine-learning-engineer-part-1-the-data","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-1-the-data\/","title":{"rendered":"Learnings from a Machine Learning Engineer \u2014 Part 1: The Data"},"content":{"rendered":"<p>    Learnings from a Machine Learning Engineer \u2014 Part 1: The Data<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\" id=\"760c\">It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you the unique processes that I have learned over several years building an ever-growing image classification system and how you can apply these techniques to your own application.<\/p>\n<p class=\"wp-block-paragraph\" id=\"628a\">With persistence and diligence, you can avoid the classic \u201cgarbage in, garbage out\u201d, maximize your model accuracy, and demonstrate real business value.<\/p>\n<p class=\"wp-block-paragraph\" id=\"50ae\">In this series of articles, I will dive into the care and feeding of a multi-class, single-label image classification app and what it takes to reach the highest level of performance. I won\u2019t get into any coding or specific user interfaces, just the main concepts that you can incorporate to suit your needs with the tools at your disposal.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d328\">Here is a brief description of the articles. You will notice that the model is last on the list since we need to focus on curating the data first and foremost:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Part 1 \u2014 The Data \u2014 Labelling standards, classes and sub-classes<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-2-the-data-sets\/\">Part 2 \u2014 The Data Sets<\/a>\u00a0\u2014 Cutoffs and thresholds, benchmark sets, staged and synthetic data<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Part 3 \u2014 The Evaluation<\/a>\u00a0\u2014 Trained model versus deployed model evaluations<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4 \u2014 The Model\u00a0<\/a>\u2014 Fine tuning, bulk identification, performance reporting<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"bc19\"><strong>Background<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"e364\">Over the past six years, I have been primarily focused on building and maintaining an image classification application for a manufacturing company. Back when I started, most of the software did not exist or was too expensive, so I created these from scratch. In this time, I have deployed two identifier applications, the largest handles 1,500 classes and achieves 97\u201398% accuracy.<\/p>\n<p class=\"wp-block-paragraph\" id=\"839a\">It was about eight years ago that I started online studies for <a href=\"https:\/\/towardsdatascience.com\/tag\/data-science\/\" title=\"Data Science\">Data Science<\/a> and machine learning. So, when the exciting opportunity to create an AI application presented itself, I was prepared to build the tools I needed to leverage the latest advancements. I jumped in with both feet!<\/p>\n<p class=\"wp-block-paragraph\" id=\"00f9\">I quickly found that building and deploying a model is probably the easiest part of the job. Feeding high quality data into the model is the best way to improve performance, and that requires focus and patience. Attention to detail is what I do best, so this was a perfect fit.<\/p>\n<h2 class=\"wp-block-heading\" id=\"cfd5\"><strong>It all starts with the data<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"1ce5\">I feel that so much attention is given to the model selection (deciding which neural network is best) and that the data is just an afterthought. I have found the hard way that even one or two pieces of bad data can significantly impact model performance, so that is where we need to focus.<\/p>\n<p class=\"wp-block-paragraph\" id=\"48fc\">For example, let\u2019s say you train the classic cat versus dog image classifier. You have 50 pictures of cats and 50 pictures of dogs, however one of the \u201ccats\u201d is clearly (objectively) a picture of a dog. The computer doesn\u2019t have the luxury of ignoring the mislabelled image, and instead adjusts the model weights to make it fit. Square peg meets round hole.<\/p>\n<p class=\"wp-block-paragraph\" id=\"59e8\">Another example would be a picture of a cat that climbed up into a tree. But when you take a wholistic view of it, you would describe it as a picture of a tree (first) with a cat (second). Again, the computer doesn\u2019t know to ignore the big tree and focus on the cat \u2014 it will start to identify trees as cats, even if there is a dog. You can think of these pictures as outliers and should be removed.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3e08\">It doesn\u2019t matter if you have the best neural network in the world, you can count on the model making poor predictions when it is trained on \u201cbad\u201d data. I\u2019ve learned that any time I see the model make mistakes, it\u2019s time to review the data.<\/p>\n<h2 class=\"wp-block-heading\" id=\"9cc0\"><strong>Example Application \u2014 Zoo animals<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"2eab\">For the rest of this write-up, I will use an example of identifying zoo animals. Let\u2019s assume your goal is to create a mobile app where guests at the zoo can take pictures of the animals they see and have the app identify them. Specifically, this is a multi-class, single-label application.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e8c2\">Here is your challenge:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Variety<\/strong>\u00a0\u2014 There are a lot of different animals at the zoo and many of them look very similar.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Quality<\/strong>\u00a0\u2014 Guests using the app don\u2019t always take good pictures (zoomed out, blurry, too dark), so we don\u2019t want to provide an answer if the image is poor.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Growth<\/strong>\u00a0\u2014 The zoo keeps expanding and adding new species all the time.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Out-of-scope<\/strong>\u00a0\u2014 Occasionally you might find that people take pictures of the sparrows near the food court grabbing some dropped popcorn.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Pranksters<\/strong>\u00a0\u2014 Just for fun, guests may take a picture of the bag of popcorn just to see what it comes back with.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"0c9d\">These are all real challenges \u2014 being able to tell the subtle differences between animals, handling out-of-scope cases, and just plain poor images.<\/p>\n<p class=\"wp-block-paragraph\" id=\"130f\">Before we get there, let\u2019s start from the beginning.<\/p>\n<h2 class=\"wp-block-heading\" id=\"a22b\"><strong>Collecting and Labelling<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"d9bc\">There are a lot of tools these days to help you with this part of the process, but the challenge remains the same \u2014 collecting, labelling, and curating the data.<\/p>\n<p class=\"wp-block-paragraph\" id=\"45ba\">Having data to collect is challenge #1. Without images, you have nothing to train. You may need to get creative on sourcing the data, or even creating synthetic data. More on that later.<\/p>\n<p class=\"wp-block-paragraph\" id=\"1620\">A quick note about image pre-processing. I convert all my images to the input size of my neural network and save them as PNG. Inside this square PNG, I preserve the aspect ratio of the original picture and fill the background black. I don\u2019t stretch the image nor crop any features out. This also helps center the subject.<\/p>\n<p class=\"wp-block-paragraph\" id=\"66db\">Challenge #2 is to establish standards for data quality\u2026and ensure that these standards are followed! These standards will guide you toward that \u201cgood\u201d data. And this assumes, of course, correct labels. Having both is much easier said than done!<\/p>\n<p class=\"wp-block-paragraph\" id=\"8f6d\">I hope to show how \u201cgood\u201d and \u201ccorrect\u201d actually go hand-in-hand, and how important it is to apply these standards to every image.<\/p>\n<h2 class=\"wp-block-heading\" id=\"ddfb\"><strong>Good Data<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"a0a0\">First, I want to point out that the image data discussed here is for the training set. What qualifies as a good image for\u00a0<strong>training<\/strong>\u00a0is a bit different than what qualifies as a good image for\u00a0<strong>evaluation<\/strong>. More on that in <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Part 3<\/a>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7299\">So, what is \u201cgood\u201d data when talking about images? \u201cA picture is worth a thousand words\u201d, and if the\u00a0<strong>first words<\/strong>\u00a0you use to describe the picture do not include the subject you are trying to label, then it is not good and you need remove it from your training set.<\/p>\n<p class=\"wp-block-paragraph\" id=\"93e4\">For example, let\u2019s say you are shown a picture of a zebra and (removing bias toward your application) you describe it as an \u201copen field with a zebra in the distance\u201d. In other words, if \u201copen field\u201d is the first thing you notice, then you likely do\u00a0<strong>not<\/strong>\u00a0want to use that image. The opposite is also true \u2014 if the picture is way too close, you would described it as \u201czebra pattern\u201d.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"9c9d59\" data-has-transparency=\"false\" style=\"--dominant-color: #9c9d59;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_WLXsqLb60OejrMR2-1024x682.png?resize=1024%2C682&#038;ssl=1\" alt=\"\" class=\"wp-image-597823 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_WLXsqLb60OejrMR2-1024x682.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_WLXsqLb60OejrMR2-300x200.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_WLXsqLb60OejrMR2-768x512.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_WLXsqLb60OejrMR2.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@traveleroohlala\">Meg von Haartman<\/a> on <a href=\"https:\/\/unsplash.com\/\">Unsplash<\/a><\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3f372b\" data-has-transparency=\"false\" style=\"--dominant-color: #3f372b;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_aXrXN1Zukk83FPEB-1024x576.png?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-597824 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_aXrXN1Zukk83FPEB-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_aXrXN1Zukk83FPEB-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_aXrXN1Zukk83FPEB-768x432.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_aXrXN1Zukk83FPEB.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jdent\" target=\"_blank\" rel=\"noreferrer noopener\">Jason Dent<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"8d9167\" data-has-transparency=\"false\" style=\"--dominant-color: #8d9167;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_3KCnjRLepSP0tGC5-1024x682.png?resize=1024%2C682&#038;ssl=1\" alt=\"\" class=\"wp-image-597825 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_3KCnjRLepSP0tGC5-1024x682.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_3KCnjRLepSP0tGC5-300x200.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_3KCnjRLepSP0tGC5-768x512.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_3KCnjRLepSP0tGC5.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@martinols3n\" target=\"_blank\" rel=\"noreferrer noopener\">Martin Olsen<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"aa7e\">What you want is a description like, \u201ca zebra, front and center\u201d. This would have your subject taking up about 80\u201390% of the total frame. Sometimes I will take the time to crop the original image so the subject is framed properly.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7657\">Keep in mind the use of image augmentation at the time of training. Having that buffer around the edges will allow \u201czoom in\u201d augmentation. And \u201czoom out\u201d augmentation will simulate smaller subjects, so don\u2019t start out less than 50% of the total frame for your subject since you lose detail.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3a3a\">Another aspect of a \u201cgood\u201d image relates to the label. If you can only see the back side of your zoo animal, can you really tell, for example, that it is a cheetah versus a leopard? The key identifying features need to be visible. If a human struggles to identify it, you can\u2019t expect the computer to learn anything.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"978e47\" data-has-transparency=\"false\" style=\"--dominant-color: #978e47;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1014\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej-1024x1014.png?resize=1024%2C1014&#038;ssl=1\" alt=\"\" class=\"wp-image-597826 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej-1024x1014.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej-300x297.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej-150x150.png 150w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej-768x760.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_QZDXSNnolypIA_Ej.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jdh84\" target=\"_blank\" rel=\"noreferrer noopener\">Jan Harder<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"8e84\">What does a \u201cbad\u201d image look like? Here is what I frequently watch out for:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Wide angle lens stretching<\/li>\n<li class=\"wp-block-list-item\">Back-lit or silohuette<\/li>\n<li class=\"wp-block-list-item\">High contrast or dark shadows<\/li>\n<li class=\"wp-block-list-item\">Blurry or hazy<\/li>\n<li class=\"wp-block-list-item\">Obscured features<\/li>\n<li class=\"wp-block-list-item\">Multiple subjects<\/li>\n<li class=\"wp-block-list-item\">\u201cDoctored\u201d images, drawn lines and arrows<\/li>\n<li class=\"wp-block-list-item\">\u201cUnusual\u201d angles or situations<\/li>\n<li class=\"wp-block-list-item\">Picture of a mobile device that has a picture of your subject<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"ba7e\"><strong>Correct Labels<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"d158\">If you have a team of subject matter experts (SMEs) on hand to label the images, you are in a good starting position. Animal trainers at the zoo know the various species, and can spot the differences between, for example, a chimpanzee and a bonobo.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"989383\" data-has-transparency=\"false\" style=\"--dominant-color: #989383;\" loading=\"lazy\" decoding=\"async\" width=\"683\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-683x1024.png?resize=683%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-597827 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-683x1024.png 683w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-200x300.png 200w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-768x1151.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-1025x1536.png 1025w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC-1367x2048.png 1367w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_r3a1LkgvwcfjlavC.png 1400w\" sizes=\"auto, (max-width: 683px) 100vw, 683px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@aadelee\" target=\"_blank\" rel=\"noreferrer noopener\">Ad\u00e8le<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3f4d45\" data-has-transparency=\"false\" style=\"--dominant-color: #3f4d45;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj-1024x1024.png?resize=1024%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-597828 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj-1024x1024.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj-300x300.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj-150x150.png 150w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj-768x768.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_H8Sz4hOQqrxad9uj.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@andriusordojan\" target=\"_blank\" rel=\"noreferrer noopener\">Andrius Ordojan<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"ce88\">To a <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning-engineer\/\" title=\"Machine Learning Engineer\">Machine Learning Engineer<\/a>, it is easy for you to assume all labels from your SMEs are correct and move right on to training the model. However, even experts make mistakes, so if you can get a second opinion on the labels, your error rate should go down.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ab03\">In reality, it can be prohibitively expensive to get one, let alone two, subject matter experts to review image labels. The SME usually has years of experience that make them more valuable to the business in other areas of work. My experience is that the machine learning engineer (that\u2019s you and me) becomes the second opinion, and often the first opinion as well.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d965\">Over time, you can become pretty adept at labelling, but certainly not an SME. If you do have the luxury of access to an expert, explain to them the labelling standards and how these are required for the application to be successful. Emphasize \u201cquality over quantity\u201d.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9d28\">It goes without saying that having a\u00a0<strong>correct<\/strong>\u00a0label is so important. However, all it takes is one or two mislabelled images to degrade performance. These can easily slip into your data set with careless or hasty labelling. So, take the time to get it right.<\/p>\n<p class=\"wp-block-paragraph\" id=\"f8b9\">Ultimately, we as the ML engineer are responsible for model performance. So, if we take the approach of only working on model training and deployment, we will find ourselves wondering why performance is falling short.<\/p>\n<h2 class=\"wp-block-heading\" id=\"3f14\"><strong>Unknown Labels<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"7a80\">A lot of times, you will come across a really good picture of a very interesting subject, but have no idea what it is! It would be a shame to simply dispose of it. What you can do is assign it a generic label, like \u201cUnknown Bird\u201d or \u201cRandom Plant\u201d that are\u00a0<strong>not<\/strong>\u00a0included in your training set. Later in <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>, you\u2019ll see how to come back to these images at a later date when you have a better idea what they are, and you\u2019ll be glad you saved them.<\/p>\n<h2 class=\"wp-block-heading\" id=\"b79e\"><strong>Model Assistance<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"ddba\">If you have done any image labelling, then you know how time consuming and difficult it can be. But this is where having a model, even a less-than-perfect model, can help you.<\/p>\n<p class=\"wp-block-paragraph\" id=\"72ea\">Typically, you have a large collection of unlabelled image and you need to go through them one at a time to assign labels. Simply having the model offer a best guess and display the top 3 results lets you step through each image in a matter of seconds!<\/p>\n<p class=\"wp-block-paragraph\" id=\"056b\">Even if the top 3 results are wrong, this can help you narrow down your search. Over time, newer models will get better, and the labelling process can even be somewhat fun!<\/p>\n<p class=\"wp-block-paragraph\" id=\"e9eb\">In <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>, I will show how you can bulk identify images and take this to the next level for faster labelling.<\/p>\n<h2 class=\"wp-block-heading\" id=\"8aaa\"><strong>Classes and Sub-Classes<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"da2d\">I mentioned the example above of two species that look very similar, the chimpanzee and the bonobo. When you start out building your data set, you may have very sparse coverage of one or both of these species. In machine learning terms, we these \u201cclasses\u201d. One option is to roll with what you have and hope that the model picks up on the differences with only a handful of example images.<\/p>\n<p class=\"wp-block-paragraph\" id=\"f81b\">The option that I have used is to merge two or more classes into one, at least temporarily. So, in this case I would create a class called \u201cchimp-bonobo\u201d, which is composed of the limited example pictures of chimpanzee and bonobo species classes. Combined, these may give me enough to train the model on \u201cchimp-bonobo\u201d, with the trade-off that it\u2019s a more generic identification.<\/p>\n<p class=\"wp-block-paragraph\" id=\"2f78\">Sub-classes can even be normal variations. For example,\u00a0<strong>juvenile<\/strong>\u00a0pink flamingos are grey instead of pink. Or, male and female orangutans have distinct facial features. You wan to have a fairly balanced number of images for these normal variations, and keeping sub-classes will allow you to accomplish this.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"8b8486\" data-has-transparency=\"false\" style=\"--dominant-color: #8b8486;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"870\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_okCGCryJNYK10sxP-1024x870.png?resize=1024%2C870&#038;ssl=1\" alt=\"\" class=\"wp-image-597829 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_okCGCryJNYK10sxP-1024x870.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_okCGCryJNYK10sxP-300x255.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_okCGCryJNYK10sxP-768x653.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_okCGCryJNYK10sxP.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@thephotochad\" target=\"_blank\" rel=\"noreferrer noopener\">David Valentine<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"54463c\" data-has-transparency=\"false\" style=\"--dominant-color: #54463c;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_MTZGrbmPRHX7IHCl-1024x768.png?resize=1024%2C768&#038;ssl=1\" alt=\"\" class=\"wp-image-597830 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_MTZGrbmPRHX7IHCl-1024x768.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_MTZGrbmPRHX7IHCl-300x225.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_MTZGrbmPRHX7IHCl-768x576.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_MTZGrbmPRHX7IHCl.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@hbsun2013\" target=\"_blank\" rel=\"noreferrer noopener\">Hongbin<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"0279\">Don\u2019t be concerned that you are merging completely different looking classes \u2014 the neural network does a nice job of applying the \u201cOR\u201d operator. This works both ways \u2014 it can help you identify male or female variations as one species, but it can hurt you when \u201cbad\u201d outlier images sneak in like the example \u201copen field with a zebra in the distance.\u201d<\/p>\n<p class=\"wp-block-paragraph\" id=\"2f55\">Over time, you will (hopefully) be able to collect more images of the sub-classes and then be able to successfully split them apart (if necessary) and train the model to identify them separately. This process has worked very well for me. Just be sure to double-check all the images when you split them to ensure the labels didn\u2019t get accidentally mixed up \u2014 it will be time well spent.<\/p>\n<p class=\"wp-block-paragraph\" id=\"0d98\">All of this certainly depends on your user requirements, and you can handle this in different ways either by creating a unique class label like \u201cchimp-bonobo\u201d, or at the front-end presentation layer where you notify the user that you have intentionally merged these classes and provide guidance on further refining the results. Even after you decide to split the two classes, you may want to caution the user that the model could be wrong since the two classes are so similar.<\/p>\n<h2 class=\"wp-block-heading\" id=\"4a42\"><strong>Up next\u2026<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"6d1c\">I realize this was a long write-up for something that on the surface seems intuitive, but these are all areas that I have tripped me up in the past because I didn\u2019t give them enough attention. Once you have a solid understanding of these principles, you can go on to build a successful application.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d499\">In\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-2-the-data-sets\/\">Part 2<\/a>, we will take the curated data we collected here to create the classic data sets, with a custom benchmark set that will further enhance your data. Then we will see how best to evaluate our trained model using a specific \u201ctraining mindset\u201d, and switch to a \u201cproduction mindset\u201d when evaluating a deployed model.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-1-the-data\/\">Learnings from a Machine Learning Engineer \u2014 Part 1: The Data<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    David Martin<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-1-the-data\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learnings from a Machine Learning Engineer \u2014 Part 1: The Data It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,83,1741,1322,70,909],"tags":[84,103,1740],"class_list":["post-1849","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-science","category-datasets","category-image-classification","category-machine-learning","category-machine-learning-engineer","tag-data","tag-model","tag-part"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1849"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1849"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1849\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1849"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1849"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1849"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}