{"id":1850,"date":"2025-02-14T07:03:12","date_gmt":"2025-02-14T07:03:12","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/"},"modified":"2025-02-14T07:03:12","modified_gmt":"2025-02-14T07:03:12","slug":"learnings-from-a-machine-learning-engineer-part-3-the-evaluation","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/","title":{"rendered":"Learnings from a Machine Learning Engineer \u2014 Part 3: The Evaluation"},"content":{"rendered":"<p>    Learnings from a Machine Learning Engineer \u2014 Part 3: The Evaluation<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\" id=\"59e9\">In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a\u00a0<strong>trained<\/strong>\u00a0model (one not yet in production), and evaluation of a\u00a0<strong>deployed<\/strong>\u00a0model (one making real-world predictions).<\/p>\n<p class=\"wp-block-paragraph\" id=\"34ca\">In\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-1-the-data\/\">Part 1<\/a>, I discussed the process of labelling your image data that you use in your <a href=\"https:\/\/towardsdatascience.com\/tag\/image-classification\/\" title=\"Image Classification\">Image Classification<\/a> project. I showed how to define \u201cgood\u201d images and create sub-classes. In\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-2-the-data-sets\/\">Part 2<\/a>, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.<\/p>\n<h2 class=\"wp-block-heading\" id=\"954f\"><strong>Evaluation of the trained model<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"7481\">As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c6a3\">Although it can be time consuming, I find it very important to manually review the images that the model gets\u00a0<strong>wrong<\/strong>, as well as the images that the model gives a\u00a0<strong>low<\/strong>\u00a0softmax \u201cconfidence\u201d score to. This means adding a step immediately after your training run completes to calculate scores for\u00a0<strong>all<\/strong>\u00a0images \u2014 training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below<\/p>\n<p class=\"wp-block-paragraph\" id=\"291a\">What you do during the manual evaluation is to put yourself in a \u201c<strong>training mindset<\/strong>\u201d to ensure that the labelling standards are being followed that you setup in <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-1-the-data\/\">Part 1<\/a>. Ask yourself:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201cIs this a good image?\u201d Is the subject front and center, and can you clearly see all the features?<\/li>\n<li class=\"wp-block-list-item\">\u201cIs this the correct label?\u201d Don\u2019t be surprised if you find wrong labels.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"f95c\">You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201cWhy did the model get this wrong?\u201d<\/li>\n<li class=\"wp-block-list-item\">\u201cWhy did this image get a low score?\u201d<\/li>\n<li class=\"wp-block-list-item\">\u201cWhat is it about the image that caused confusion?\u201d<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"7537\">Sometimes the answer has nothing to do with\u00a0<strong>that<\/strong>\u00a0specific image. Frequently, it has to do with the\u00a0<strong>other<\/strong>\u00a0images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don\u2019t be surprised if you find poor images or wrong labels.<\/p>\n<h2 class=\"wp-block-heading\" id=\"c097\"><strong>Weighted evaluation<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"7d67\">When doing the evaluation of the trained model (above), we apply a lot of subjective analysis \u2014 \u201cWhy did the model get this wrong?\u201d and \u201cIs this a good image?\u201d From these, you may only get a\u00a0<strong>gut feeling<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"1250\">Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more\u00a0<strong>objective<\/strong>\u00a0analysis comes in by creating a weighted average of the softmax \u201cconfidence\u201d scores.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7438\">In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of \u201ccommonly confused\u201d classes.<\/p>\n<h2 class=\"wp-block-heading\" id=\"6750\"><strong>Commonly confused classes<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"115e\">Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"8c7d70\" data-has-transparency=\"false\" style=\"--dominant-color: #8c7d70;\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_LQbV7OZ0jn6Gto-4-1024x683.webp?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-597865 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_LQbV7OZ0jn6Gto-4-1024x683.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_LQbV7OZ0jn6Gto-4-300x200.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_LQbV7OZ0jn6Gto-4-768x512.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_LQbV7OZ0jn6Gto-4.webp 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@mattbango\" target=\"_blank\" rel=\"noreferrer noopener\">Matt Bango<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"775530\" data-has-transparency=\"false\" style=\"--dominant-color: #775530;\" decoding=\"async\" width=\"683\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-683x1024.webp?resize=683%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-597866 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-683x1024.webp 683w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-200x300.webp 200w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-768x1152.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-1024x1536.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ-1365x2048.webp 1365w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_zeR7paNPqGtsH3NQ.webp 1400w\" sizes=\"(max-width: 683px) 100vw, 683px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@matkrizmanich\" target=\"_blank\" rel=\"noreferrer noopener\">Mathew Krizmanich<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"3de2\">This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the \u201cweight\u201d function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.<\/p>\n<p class=\"has-text-align-left wp-block-paragraph\" id=\"67b3\">In other words, it\u2019s better to be unsure (have a\u00a0<strong>lower<\/strong>\u00a0confidence score) when you are wrong, compared to being super confident and wrong.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"000000\" data-has-transparency=\"true\" style=\"--dominant-color: #000000;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"95\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Fx-AxiysOE4AL08IzUqu_Q-1024x95.webp?resize=1024%2C95&#038;ssl=1\" alt=\"\" class=\"wp-image-597867 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Fx-AxiysOE4AL08IzUqu_Q-1024x95.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Fx-AxiysOE4AL08IzUqu_Q-300x28.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Fx-AxiysOE4AL08IzUqu_Q-768x71.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Fx-AxiysOE4AL08IzUqu_Q.webp 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Modified cross-entropy loss function, image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"9e3a\">Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.<\/p>\n<h2 class=\"wp-block-heading\" id=\"a813\"><strong>Confidence threshold report<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"a781\">Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.<\/p>\n<p class=\"wp-block-paragraph\" id=\"0d45\">In this table, we look at the breakdown of \u201ctrue positive above 95\u201d for each data set. We get a sense that when a \u201cgood\u201d picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is \u201chappy\u201d with the outcome. Conversely, the \u201cfalse positive above 95\u201d is extremely low for good pictures, thus only a small number of our users will be \u201csad\u201d about the results.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"dfdfdf\" data-has-transparency=\"false\" style=\"--dominant-color: #dfdfdf;\" loading=\"lazy\" decoding=\"async\" width=\"579\" height=\"227\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_WFmtWDLncUIQe_TXZLWtow.webp?resize=579%2C227&#038;ssl=1\" alt=\"\" class=\"wp-image-597868 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_WFmtWDLncUIQe_TXZLWtow.webp 579w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_WFmtWDLncUIQe_TXZLWtow-300x118.webp 300w\" sizes=\"auto, (max-width: 579px) 100vw, 579px\"><figcaption class=\"wp-element-caption\">Example Confidence Threshold Report, image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"f46e\">We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take \u201cgood\u201d pictures, the model should do very well. But to get a sense of how it does on extreme situations, let\u2019s take a look at our benchmarks.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8cff\">The \u201cdifficult\u201d benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min\/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.<\/p>\n<p class=\"wp-block-paragraph\" id=\"06ea\">The \u201cout-of-scope\u201d benchmark has no true positive rate because\u00a0<strong>none<\/strong>\u00a0of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"8b675f\" data-has-transparency=\"false\" style=\"--dominant-color: #8b675f;\" loading=\"lazy\" decoding=\"async\" width=\"576\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-576x1024.webp?resize=576%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-597869 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-576x1024.webp 576w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-169x300.webp 169w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-768x1365.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-864x1536.webp 864w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5-1152x2048.webp 1152w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_TAT5BkpzkdJFTkF5.webp 1400w\" sizes=\"auto, (max-width: 576px) 100vw, 576px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@linusmimietz\" target=\"_blank\" rel=\"noreferrer noopener\">Linus Mimietz<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"5e65\">Right now, you may be thinking, \u201cWell, what animal did it pick for the bag of popcorn?\u201d Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.<\/p>\n<h2 class=\"wp-block-heading\" id=\"152f\"><strong>Evaluation of the deployed model<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"35d2\">The evaluation that I described above applies to a model immediately after\u00a0<strong>training<\/strong>. Now, you want to evaluate how your model is doing in the\u00a0<strong>real world<\/strong>. The process is similar, but requires you to shift to a \u201c<strong>production mindset<\/strong>\u201d and asking yourself, \u201cDid the model get this correct?\u201d and \u201cShould it have gotten this correct?\u201d and \u201cDid we tell the user the right thing?\u201d<\/p>\n<p class=\"wp-block-paragraph\" id=\"987e\">So, imagine that you are logging in for the morning \u2014 after sipping on your\u00a0<a href=\"https:\/\/medium.com\/@dmartin0409\/cold-brew-coffee-0aabd53a1f3e\">cold brew coffee<\/a>, of course \u2014 and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.<\/p>\n<p class=\"wp-block-paragraph\" id=\"f78f\">Using the softmax \u201cconfidence\u201d score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I\u2019ll call this the \u201chappy path\u201d. And below the threshold is the \u201csad path\u201d where we ask them to try again.<\/p>\n<p class=\"wp-block-paragraph\" id=\"1097\">Your review interface will first show you all the \u201chappy path\u201d images one at a time. This is where you ask yourself, \u201cDid we get this right?\u201d Hopefully, yes!<\/p>\n<p class=\"wp-block-paragraph\" id=\"a700\">But if not, this is where things get tricky. So now you have to ask, \u201cWhy not?\u201d Here are some things that it could be:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201cBad\u201d picture \u2014 Poor lighting, bad angle, zoomed out, etc \u2014 refer to your labelling standards.<\/li>\n<li class=\"wp-block-list-item\">Out-of-scope \u2014 It\u2019s a zoo animal, but unfortunately one that isn\u2019t found in\u00a0<strong>this<\/strong>\u00a0zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.<\/li>\n<li class=\"wp-block-list-item\">Out-of-scope \u2014 It\u2019s not a zoo animal. It could be an animal in your zoo, but not one typically\u00a0<em>contained<\/em>\u00a0there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.<\/li>\n<li class=\"wp-block-list-item\">Out-of-scope \u2014 It\u2019s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.<\/li>\n<li class=\"wp-block-list-item\">Prankster \u2014 Completely out-of-scope. Because people like to play with technology, there\u2019s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"4647\">After reviewing the \u201chappy path\u201d images, you move on to the \u201csad path\u201d images \u2014 the ones that got a low confidence score and the app gave a \u201csorry, try again\u201d message. This time you ask yourself, \u201c<em>Should<\/em>\u00a0the model have given this image a higher score?\u201d which would have put it in the \u201chappy path\u201d. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the \u201cbad\u201d or out-of-scope situations mentioned above.<\/p>\n<p class=\"wp-block-paragraph\" id=\"6259\">Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Are your users using the application in the ways you expected?<\/li>\n<li class=\"wp-block-list-item\">Are they not following the instructions?<\/li>\n<li class=\"wp-block-list-item\">Do the instructions need to be stated more clearly?<\/li>\n<li class=\"wp-block-list-item\">Is there anything you can do to improve the experience?<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"2eb8\"><strong>Collect statistics and new images<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"c492\">Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard \u2014 your manager and your future self will thank you!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"b9c0c6\" data-has-transparency=\"false\" style=\"--dominant-color: #b9c0c6;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_ZvjYSGNOUvODS38c-1024x683.webp?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-597870 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_ZvjYSGNOUvODS38c-1024x683.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_ZvjYSGNOUvODS38c-300x200.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_ZvjYSGNOUvODS38c-768x512.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_ZvjYSGNOUvODS38c.webp 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@justin_morgan\" target=\"_blank\" rel=\"noreferrer noopener\">Justin Morgan<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"b47d\">Keep track of these stats and generate reports that you and your can reference:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How often the model is being called?<\/li>\n<li class=\"wp-block-list-item\">What times of the day, what days of the week is it used?<\/li>\n<li class=\"wp-block-list-item\">Are your system resources able to handle the peak load?<\/li>\n<li class=\"wp-block-list-item\">What classes are the most common?<\/li>\n<li class=\"wp-block-list-item\">After evaluation, what is the accuracy for each class?<\/li>\n<li class=\"wp-block-list-item\">What is the breakdown for confidence scores?<\/li>\n<li class=\"wp-block-list-item\">How many scores are above and below the confidence threshold?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"d79f\">The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on\u00a0<strong>other<\/strong>\u00a0classes to add. For example, let\u2019s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a89e\">Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, \u201cThank you for visiting the food court.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"d564\"><strong>Double-check process<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"73f5\">It is a good idea to double-check your image set when you suspect there may be problems with your data. I\u2019m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e5fe\">Immediately after my training run completes, I have a script that will use this new model to generate predictions for my\u00a0<strong>entire<\/strong>\u00a0data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.<\/p>\n<p class=\"wp-block-paragraph\" id=\"eb5b\">This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a \u201cgood\u201d picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d652\">At this point I can:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Remove the original image if the image quality is poor.<\/li>\n<li class=\"wp-block-list-item\">Relabel the image if it belongs in a different class.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"8a7f\">During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!<\/p>\n<p class=\"wp-block-paragraph\" id=\"6638\">Don\u2019t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100\u2013200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.<\/p>\n<h2 class=\"wp-block-heading\" id=\"2634\"><strong>Up next\u2026<\/strong><\/h2>\n<p class=\"wp-block-paragraph\" id=\"7b73\">With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the \u201cgut feel\u201d of your model, we can rely on the benchmark scores to support us.<\/p>\n<p class=\"wp-block-paragraph\" id=\"fed3\">In\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.<a href=\"https:\/\/medium.com\/tag\/machine-learning?source=post_page-----e4a8dbb035e0---------------------------------------\"><\/a><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Learnings from a Machine Learning Engineer \u2014 Part 3: The Evaluation<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    David Martin<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learnings from a Machine Learning Engineer \u2014 Part 3: The Evaluation In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a\u00a0trained\u00a0model (one not yet in [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,166,1322,70,909,1500,1343],"tags":[768,1005,103],"class_list":["post-1850","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-hands-on-tutorials","category-image-classification","category-machine-learning","category-machine-learning-engineer","category-model-evaluation","category-process-improvement","tag-evaluation","tag-images","tag-model"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1850"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1850"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1850\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}