{"id":1851,"date":"2025-02-14T07:03:13","date_gmt":"2025-02-14T07:03:13","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-5-the-training\/"},"modified":"2025-02-14T07:03:13","modified_gmt":"2025-02-14T07:03:13","slug":"learnings-from-a-machine-learning-engineer-part-5-the-training","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/14\/learnings-from-a-machine-learning-engineer-part-5-the-training\/","title":{"rendered":"Learnings from a Machine Learning Engineer \u2014 Part 5: The Training"},"content":{"rendered":"<p>    Learnings from a Machine Learning Engineer \u2014 Part 5: The Training<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\" id=\"2f20\">In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3852\">AI\/ML engineers would prefer to focus on model training and data engineering, but the reality is that we also need to understand the infrastructure and mechanics behind the scenes.<\/p>\n<p class=\"wp-block-paragraph\" id=\"517e\">I hope to share some tips, not only to get your training run running, but how to streamline the process in a cost efficient manner on cloud resources such as Kubernetes.<\/p>\n<p class=\"wp-block-paragraph\" id=\"20fb\">I will reference elements from my previous articles for getting the best model performance, so be sure to check out\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-1-the-data\/\">Part 1<\/a>\u00a0and\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-2-the-data-sets\/\">Part 2<\/a>\u00a0on the data sets, as well as\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Part 3<\/a>\u00a0and\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>\u00a0on model evaluation.<\/p>\n<p class=\"wp-block-paragraph\" id=\"393f\">Here are the learnings that I will share with you, once we lay the groundwork on the infrastructure:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Building your Docker container<\/li>\n<li class=\"wp-block-list-item\">Executing your training run<\/li>\n<li class=\"wp-block-list-item\">Deploying your model<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"f7d9\">Infrastructure overview<\/h2>\n<p class=\"wp-block-paragraph\" id=\"aa5f\">First, let me provide a brief description of the setup that I created, specifically around Kubernetes. Your setup may be entirely different, and that is just fine. I simply want to set the stage on the infrastructure so that the rest of the discussion makes sense.<\/p>\n<h3 class=\"wp-block-heading\" id=\"e86d\">Image management system<\/h3>\n<p class=\"wp-block-paragraph\" id=\"50ba\">This is a server you deploy that provides a user interface to for your subject matter experts to label and evaluate images for the image classification application. The server can run as a pod on your Kubernetes cluster, but you may find that running a dedicated server with faster disk may be better.<\/p>\n<p class=\"wp-block-paragraph\" id=\"0350\">Image files are stored in a directory structure like the following, which is self-documenting and easily modified.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">Image_Library\/\n  - cats\/\n    - image1001.png\n  - dogs\/\n    - image2001.png<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"ff20\">Ideally, these files would reside on local server storage (instead of cloud or cluster storage) for better performance. The reason for this will become clear as we see what happens as the image library grows.<\/p>\n<h3 class=\"wp-block-heading\" id=\"b539\">Cloud storage<\/h3>\n<p class=\"wp-block-paragraph\" id=\"07ce\"><a href=\"https:\/\/towardsdatascience.com\/tag\/cloud-storage\/\" title=\"Cloud Storage\">Cloud Storage<\/a> allows for a virtually limitless and convenient way to share files between systems. In this case, the image library on your management system could access the same files as your Kubernetes cluster or Docker engine.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8ee6\">However, the downside of cloud storage is the latency to open a file. Your image library will have\u00a0<strong>thousands and thousands<\/strong>\u00a0of images, and the latency to read each file will have a significant impact on your training run time. Longer training runs means more cost for using the expensive GPU processors!<\/p>\n<p class=\"wp-block-paragraph\" id=\"450a\">The way that I found to speed things up is to create a\u00a0<em>tar<\/em>\u00a0file of your image library on your management system and copy them to cloud storage. Even better would be to create multiple tar files\u00a0<strong>in parallel<\/strong>, each containing 10,000 to 20,000 images.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8f19\">This way you only have network latency on a handful of files (which contain thousands, once extracted) and you start your training run much sooner.<\/p>\n<h3 class=\"wp-block-heading\" id=\"d386\">Kubernetes or Docker engine<\/h3>\n<p class=\"wp-block-paragraph\" id=\"3b96\">A Kubernetes cluster, with proper configuration, will allow you to dynamically scale up\/down nodes, so you can perform your model training on GPU hardware as needed. Kubernetes is a rather heavy setup, and there are other container engines that will work.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"a79b\">The technology options change constantly!<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\" id=\"3c5e\">The main idea is that you want to spin up the resources you need \u2014 for only as long as you need them \u2014 then scale down to reduce your time (and therefore cost) of running expensive GPU resources.<\/p>\n<p class=\"wp-block-paragraph\" id=\"16cb\">Once your GPU node is started and your <a href=\"https:\/\/towardsdatascience.com\/tag\/docker\/\" title=\"Docker\">Docker<\/a> container is running, you can extract the\u00a0<em>tar<\/em>\u00a0files above to\u00a0<strong>local<\/strong>\u00a0storage, such as an\u00a0<em>emptyDir<\/em>, on your node. The node typically has high-speed SSD disk, ideal for this type of workload. There is one caveat \u2014 the storage capacity on your node must be able to handle your image library.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9f1c\">Assuming we are good, let\u2019s talk about building your Docker container so that you can train your model on your image library.<\/p>\n<h2 class=\"wp-block-heading\" id=\"3fd5\">Building your Docker container<\/h2>\n<p class=\"wp-block-paragraph\" id=\"6e4e\">Being able to execute a training run in a consistent manner lends itself perfectly to building a Docker container. You can \u201cpin\u201d the version of libraries so you know exactly how your scripts will run every time. You can version control your containers as well, and revert to a known good image in a pinch. What is really nice about Docker is you can run the container pretty much anywhere.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7815\">The tradeoff when running in a container, especially with an <a href=\"https:\/\/towardsdatascience.com\/tag\/image-classification\/\" title=\"Image Classification\">Image Classification<\/a> model, is the speed of file storage. You can attach any number of volumes to your container, but they are usually\u00a0<em>network<\/em>\u00a0attached, so there is latency on each file read. This may not be a problem if you have a small number of files. But when dealing with hundreds of thousands of files like image data, that latency adds up!<\/p>\n<p class=\"wp-block-paragraph\" id=\"0209\">This is why using the\u00a0<em>tar<\/em>\u00a0file method outlined above can be beneficial.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c1da\">Also, keep in mind that Docker containers could be terminated unexpectedly, so you should make sure to store important information outside the container, on cloud storage or a database. I\u2019ll show you how below.<\/p>\n<h3 class=\"wp-block-heading\" id=\"e557\">Dockerfile<\/h3>\n<p class=\"wp-block-paragraph\" id=\"e3a1\">Knowing that you will need to run on GPU hardware (here I will assume Nvidia), be sure to select the right base image for your Dockerfile, such as\u00a0<strong>nvidia\/cuda<\/strong>\u00a0with the \u201cdevel<strong>\u201d<\/strong>\u00a0flavor that will contain the right drivers.<\/p>\n<p class=\"wp-block-paragraph\" id=\"cb91\">Next, you will add the script files to your container, along with a \u201cbatch\u201d script to coordinate the execution. Here is an example Dockerfile, and then I\u2019ll describe what each of the scripts will be doing.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#####   Dockerfile   #####\nFROM nvidia\/cuda:12.8.0-devel-ubuntu24.04\n\n# Install system software\nRUN apt-get -y update &amp;&amp; apg-get -y upgrade\nRUN apt-get install -y python3-pip python3-dev\n\n# Setup python\nWORKDIR \/app\nCOPY requirements.txt\nRUN python3 -m pip install --upgrade pip\nRUN python3 -m pip install -r requirements.txt\n\n# Pythong and batch scripts\nCOPY ExtractImageLibrary.py .\nCOPY Training.py .\nCOPY Evaluation.py .\nCOPY ScorePerformance.py .\nCOPY ExportModel.py .\nCOPY BulkIdentification.py .\nCOPY BatchControl.sh .\n\n# Allow for interactive shell\nCMD tail -f \/dev\/null<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"0cdd\">Dockerfiles are declarative, almost like a cookbook for building a small server \u2014 you know what you\u2019ll get every time. Python libraries benefit, too, from this declarative approach. Here is a sample\u00a0<em>requirements.txt<\/em>\u00a0file that loads the TensorFlow libraries with CUDA support for GPU acceleration.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#####   requirements.txt   #####\nnumpy==1.26.3\npandas==2.1.4\nscipy==1.11.4\nkeras==2.15.0\ntensorflow[and-cuda]<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"3f93\">Extract Image Library script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"5019\">In <a href=\"https:\/\/towardsdatascience.com\/tag\/kubernetes\/\" title=\"Kubernetes\">Kubernetes<\/a>, the Docker container can access local, high speed storage on the physical node. This can be achieved via the\u00a0<em>emptyDir<\/em>\u00a0volume type. As mentioned before, this will only work if the local storage on your node can handle the size of your library.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#####   sample 25GB emptyDir volume in Kubernetes   #####\ncontainers:\n  - name: training-container\n    volumeMounts:\n      - name: image-library\n        mountPath: \/mnt\/image-library\nvolumes:\n  - name: image-library\n    emptyDir:\n      sizeLimit: 25Gi<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"4189\">You would want to have another\u00a0<em>volumeMount<\/em>\u00a0to your cloud storage where you have the\u00a0<em>tar<\/em>\u00a0files. What this looks like will depend on your provider, or if you are using a persistent volume claim, so I won\u2019t go into detail here.<\/p>\n<p class=\"wp-block-paragraph\" id=\"22fa\">Now you can extract the\u00a0<em>tar<\/em>\u00a0files \u2014 ideally in parallel for an added performance boost \u2014 to the local mount point.<\/p>\n<h3 class=\"wp-block-heading\" id=\"37e5\">Training script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"473b\">As AI\/ML engineers, the model training is where we want to spend most of our time.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"7a11\">This is where the magic happens!<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\" id=\"8270\">With your image library now extracted, we can create our train-validation-test sets, load a pre-trained model or build a new one, fit the model, and save the results.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3d63\">One key technique that has served me well is to load the most recently trained model as my base. I discuss this in more detail in\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>\u00a0under \u201cFine tuning\u201d, this results in faster training time and significantly improved model performance.<\/p>\n<p class=\"wp-block-paragraph\" id=\"068a\">Be sure to take advantage of the local storage to checkpoint your model during training since the models are quite large and you are paying for the GPU even while it sits idle writing to disk.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ab07\">This of course raises a concern about what happens if the Docker container dies part-way though the training. The risk is (hopefully) low from a cloud provider, and you may not want an incomplete training anyway. But if that does happen, you will at least want to understand\u00a0<strong>why<\/strong>, and this is where saving the main log file to cloud storage (described below) or to a package like MLflow comes in handy.<\/p>\n<h3 class=\"wp-block-heading\" id=\"96f5\">Evaluation script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"db53\">After your training run has completed and you have taken proper precaution on saving your work, it is time to see how well it performed.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9fcd\">Normally this evaluation script will pick up on the model that just finished. But you may decide to point it at a previous model version through an interactive session. This is why have the script as stand-alone.<\/p>\n<p class=\"wp-block-paragraph\" id=\"83c6\">With it being a separate script, that means it will need to read the completed model from disk \u2014 ideally local disk for speed. I like having two separate scripts (training and evaluation), but you might find it better to combine these to avoid reloading the model.<\/p>\n<p class=\"wp-block-paragraph\" id=\"cc99\">Now that the model is loaded, the evaluation script should generate predictions on\u00a0<strong>every<\/strong>\u00a0image in the training, validation, test, and benchmark sets. I save the results as a\u00a0<strong>huge<\/strong>\u00a0matrix with the softmax confidence score for each class label. So, if there are 1,000 classes and 100,000 images, that\u2019s a table with 100 million scores!<\/p>\n<p class=\"wp-block-paragraph\" id=\"3a23\">I save these results in\u00a0<em>pickle<\/em>\u00a0files that are then used in the score generation next.<\/p>\n<h3 class=\"wp-block-heading\" id=\"a472\">Score generation script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"a938\">Taking the matrix of scores produced by the evaluation script above, we can now create various metrics of model performance. Again, this process could be combined with the evaluation script above, but my preference is for independent scripts. For example, I might want to regenerate scores on previous training runs. See what works for you.<\/p>\n<p class=\"wp-block-paragraph\" id=\"f9e5\">Here are some of the\u00a0<em>sklearn<\/em>\u00a0functions that produce useful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sklearn.metrics import average_precision_score, classification_report\nfrom sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"4c46\">Aside from these basic statistical analyses for each dataset (train, validation, test, and benchmark), it is also useful to identify:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Which\u00a0<strong>ground truth<\/strong>\u00a0labels get the most number of errors?<\/li>\n<li class=\"wp-block-list-item\">Which\u00a0<strong>predicted<\/strong>\u00a0labels get the most number of incorrect guesses?<\/li>\n<li class=\"wp-block-list-item\">How many\u00a0<strong>ground-truth-to-predicted<\/strong>\u00a0label pairs are there? In other words, which classes are easily confused?<\/li>\n<li class=\"wp-block-list-item\">What is the\u00a0<strong>accuracy<\/strong>\u00a0when applying a minimum softmax confidence score threshold?<\/li>\n<li class=\"wp-block-list-item\">What is the\u00a0<strong>error rate<\/strong>\u00a0above that softmax threshold?<\/li>\n<li class=\"wp-block-list-item\">For the \u201cdifficult\u201d benchmark sets, do you get a sufficiently\u00a0<strong>high<\/strong>\u00a0score?<\/li>\n<li class=\"wp-block-list-item\">For the \u201cout-of-scope\u201d benchmark sets, do you get a sufficiently\u00a0<strong>low<\/strong>\u00a0score?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"fde6\">As you can see, there are multiple calculations and it\u2019s not easy to come up with a single evaluation to decide if the trained model is good enough to be moved to production.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9087\">In fact, for an image classification model, it is helpful to manually review the images that the model got wrong, as well as the ones that got a low softmax confidence score. Use the scores from this script to create a list of images to manually review, and then get a\u00a0<em>gut-feel<\/em>\u00a0for how well the model performs.<\/p>\n<p class=\"wp-block-paragraph\" id=\"cd2e\">Check out\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-3-the-evaluation\/\">Part 3<\/a>\u00a0for more in-depth discussion on evaluation and scoring.<\/p>\n<h3 class=\"wp-block-heading\" id=\"9931\">Export script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"947c\">All of the heavy lifting is done by this point. Since your Docker container will be shutdown soon, now is the time to copy the model artifacts to cloud storage and prepare them for being put to use.<\/p>\n<p class=\"wp-block-paragraph\" id=\"175d\">The example Python code snippet below is more geared to Keras and TensorFlow. This will take the trained model and export it as a\u00a0<em>saved_model<\/em>. Later, I will show how this is used by TensorFlow Serving in the\u00a0<strong>Deploy<\/strong>\u00a0section below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Increment current version of model and create new directory\nnext_version_dir, version_number = create_new_version_folder()\n\n# Copy model artifacts to the new directory\ncopy_model_artifacts(next_version_dir)\n\n# Create the directory to save the model export\nsaved_model_dir = os.path.join(next_version_dir, str(version_number))\n\n# Save the model export for use with TensorFlow Serving\ntf.keras.backend.set_learning_phase(0)\nmodel = tf.keras.models.load_model(keras_model_file)\ntf.saved_model.save(model, export_dir=saved_model_dir)<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"c73b\">This script also copies the other training run artifacts such as the model evaluation results, score summaries, and log files generated from model training. Don\u2019t forget about your label map so you can give human readable names to your classes!<\/p>\n<h3 class=\"wp-block-heading\" id=\"a3f6\">Bulk identification script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"d109\">Your training run is complete, your model has been scored, and a new version is exported and ready to be served. Now is the time to use this latest model to assist you on trying to identify unlabeled images.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8c6d\">As I described in\u00a0<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-4-the-model\/\">Part 4<\/a>, you may have a collection of \u201cunknowns\u201d \u2014 really good pictures, but no idea what they are. Let your new model provide a best guess on these and record the results to a file or a database. Now you can create filters based on closest match and by high\/low scores. This allows your subject matter experts to leverage these filters to find new image classes, add to existing classes, or to remove images that have very low scores and are no good.<\/p>\n<p class=\"wp-block-paragraph\" id=\"798a\">By the way, I put this step inside the GPU container since you may have thousands of \u201cunknown\u201d images to process and the accelerated hardware will make light work of it. However, if you are not in a hurry, you could perform this step on a separate CPU node, and shutdown your GPU node sooner to save cost. This would especially make sense if your \u201cunknowns\u201d folder is on slower cloud storage.<\/p>\n<h3 class=\"wp-block-heading\" id=\"94ab\">Batch script<\/h3>\n<p class=\"wp-block-paragraph\" id=\"e316\">All of the scripts described above perform a specific task \u2014 from extracting your image library, executing model training, performing evaluation and scoring, exporting the model artifacts for deployment, and perhaps even bulk identification.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"e4d6\">One script to rule them all<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\" id=\"fc6a\">To coordinate the entire show, this batch script gives you the entry point for your container and an easy way to trigger everything. Be sure to produce a log file in case you need to analyze any failures along the way. Also, be sure to write the log to your cloud storage in case the container dies unexpectedly.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#!\/bin\/bash\n# Main batch control script\n\n# Redirect standard output and standard error to a log file\nexec &gt; \/cloud_storage\/batch-logfile.txt 2&gt;&amp;1\n\n\/app\/ExtractImageLibrary.py\n\/app\/Training.py\n\/app\/Evaluation.py\n\/app\/ScorePerformance.py\n\/app\/ExportModel.py\n\/app\/BulkIdentification.py<\/code><\/pre>\n<h2 class=\"wp-block-heading\" id=\"fb11\">Executing your training run<\/h2>\n<p class=\"wp-block-paragraph\" id=\"3fb5\">So, now it\u2019s time to put everything in motion\u2026<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"9d32\">Start your engines!<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\" id=\"d948\">Let\u2019s go through the steps to prepare your image library, fire up your Docker container to train your model, and then examine the results.<\/p>\n<h3 class=\"wp-block-heading\" id=\"6787\">Image library \u2018tar\u2019 files<\/h3>\n<p class=\"wp-block-paragraph\" id=\"dbb7\">Your image management system should now create a\u00a0<em>tar<\/em>\u00a0file backup of your data. Since\u00a0<em>tar<\/em>\u00a0is a single-threaded function, you will get significant speed improvement by creating multiple tar files in parallel, each with a portion of you data.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9c50\">Now these files can be copied to your shared cloud storage for the next step.<\/p>\n<h3 class=\"wp-block-heading\" id=\"6ae9\">Start Docker container<\/h3>\n<p class=\"wp-block-paragraph\" id=\"d455\">All the hard work you put into creating your container (described above) will be put to the test. If you are running Kubernetes, you can create a Job that will execute the\u00a0<em>BatchControl.sh<\/em>\u00a0script.<\/p>\n<p class=\"wp-block-paragraph\" id=\"b527\">Inside the Kubernetes Job definition, you can pass environment variables to adjust the execution of your script. For example, the batch size and number of epochs are set here and then pulled into your Python scripts, so you can alter the behavior without changing your code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#####   sample Job in Kubernetes   #####\ncontainers:\n  - name: training-job\n    env:\n      - name: BATCH_SIZE\n        value: 50\n      - name: NUM_EPOCHS\n        value: 30\n    command: [\"\/app\/BatchControl.sh\"]<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"27cd\">Once the Job is completed, be sure to verify that the GPU node properly scales back down to zero according to your scaling configuration in Kubernetes \u2014 you don\u2019t want to be saddled with a huge bill over a simple configuration error.<\/p>\n<h3 class=\"wp-block-heading\" id=\"a884\">Manually review results<\/h3>\n<p class=\"wp-block-paragraph\" id=\"0d40\">With the training run complete, you should now have model artifacts saved and can examine the performance. Look through the metrics, such as F1 and log loss, and benchmark accuracy for high softmax confidence scores.<\/p>\n<p class=\"wp-block-paragraph\" id=\"b6b5\">As mentioned earlier, the reports only tell part of the story. It is worth the time and effort to manually review the images that the model got wrong or where it produced a low confidence score.<\/p>\n<p class=\"wp-block-paragraph\" id=\"185d\">Don\u2019t forget about the bulk identification. Be sure to leverage these to locate new images to fill out your data set, or to find new classes.<\/p>\n<h2 class=\"wp-block-heading\" id=\"92b8\">Deploying your model<\/h2>\n<p class=\"wp-block-paragraph\" id=\"94e1\">Once you have reviewed your model performance and are satisfied with the results, it is time to modify your TensorFlow Serving container to put the new model into production.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a459\">TensorFlow Serving is available as a Docker container and provides a very quick and convenient way to serve your model. This container can listen and respond to API calls for your model.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3f2f\">Let\u2019s say your new model is version 7, and your\u00a0<strong>Export<\/strong>\u00a0script (see above) has saved the model in your cloud share as\u00a0<em>\/image_application\/models\/007<\/em>. You can start the TensorFlow Serving container with that volume mount. In this example, the\u00a0<em>shareName<\/em>\u00a0points to folder for version 007.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#####   sample TensorFlow pod in Kubernetes   #####\ncontainers:\n  - name: tensorflow-serving\n    image: bitnami\/tensorflow-serving:2.18.0\n    ports:\n      - containerPort: 8501\n    env:\n      - name: TENSORFLOW_SERVING_MODEL_NAME\n        value: \"image_application\"\n    volumeMounts:\n      - name: models-subfolder\n        mountPath: \"\/bitnami\/model-data\"\n\nvolumes:\n  - name: models-subfolder\n    azureFile:\n      shareName: \"image_application\/models\/007\"<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"e446\">A subtle note here \u2014 the export script should create a sub-folder, named 007 (same as the base folder), with the saved model export. This may seem a little confusing, but TensorFlow Serving will mount this share folder as\u00a0<em>\/bitnami\/model-data<\/em>\u00a0and detect the numbered sub-folder inside it for the version to serve. This will allow you to query the API for the model version as well as the identification.<\/p>\n<h2 class=\"wp-block-heading\" id=\"43c5\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\" id=\"38be\">As I mentioned at the start of this article, this setup has worked for my situation. This is certainly not the only way to approach this challenge, and I invite you to customize your own solution.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a893\">I wanted to share my hard-fought learnings as I embraced cloud services in Kubernetes, with the desire to keep costs under control. Of course, doing all this while maintaining a high level of model performance is an added challenge, but one that you can achieve.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a086\">I hope I have provided enough information here to help you with your own endeavors. Happy learnings!<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-5-the-training\/\">Learnings from a Machine Learning Engineer \u2014 Part 5: The Training<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    David Martin<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/learnings-from-a-machine-learning-engineer-part-5-the-training\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learnings from a Machine Learning Engineer \u2014 Part 5: The Training In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment. AI\/ML engineers would prefer to focus on model training and data engineering, but the reality [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1742,1082,1322,1743,70,909],"tags":[1740,319,163],"class_list":["post-1851","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-cloud-storage","category-docker","category-image-classification","category-kubernetes","category-machine-learning","category-machine-learning-engineer","tag-part","tag-training","tag-your"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1851"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1851"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1851\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1851"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1851"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1851"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}