{"id":938,"date":"2025-01-03T07:02:51","date_gmt":"2025-01-03T07:02:51","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/03\/mastering-sensor-fusion-color-image-obstacle-detection-with-kitti-data-part-2-5118be4e92ee\/"},"modified":"2025-01-03T07:02:51","modified_gmt":"2025-01-03T07:02:51","slug":"mastering-sensor-fusion-color-image-obstacle-detection-with-kitti-data-part-2-5118be4e92ee","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/03\/mastering-sensor-fusion-color-image-obstacle-detection-with-kitti-data-part-2-5118be4e92ee\/","title":{"rendered":"Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart 2"},"content":{"rendered":"<p>    Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart 2<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h3>Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart\u00a02<\/h3>\n<h4>How to use color image data for object detection in the context of obstacle detection<\/h4>\n<p>The concept of sensor fusion is a decision-making mechanism that can be applied to different problems and using different modalities. We mentioned in the previous post that in this Medium blog series, we will analyze the concept of sensor fusion for obstacle detection with both Lidar and color images. If you haven\u2019t read that post yet, which is related to obstacle detection with Lidar data, here is the link to\u00a0it:<\/p>\n<p><a href=\"https:\/\/medium.com\/@eroltak\/sensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497\">Sensor Fusion\u200a\u2014\u200aKITTI\u200a\u2014\u200a\u2018Lidar-based Obstacle Detection\u2019\u200a\u2014\u200aPart-1<\/a><\/p>\n<p>This post is a continuation, and in this section, I will get deep into the obstacle detection problem on color images. In the next and last post of the series <strong><em>(I hope it will be available soon!)<\/em><\/strong>, we will be investigating sensor fusion using both Lidar and color\u00a0images.<\/p>\n<p>But before moving on to this step, let\u2019s continue with our uni-modality-based study. Just as we previously performed obstacle detection using only Lidar data, here we will perform obstacle detection using only color\u00a0images.<\/p>\n<p>As we did in the first post, we will use the KITTI dataset here again. For information about which data needs to be downloaded from KITTI [1], please check the previous post. There it was stated which data, labels, and calibration files are required for each data\u00a0type.<\/p>\n<p>However, for those who do not have much time, we are analyzing the 3D Object Detection problem within the scope of the KITTI Vision Benchmark Suite. In this context, we will work on color images obtained with the \u201cleft camera\u201d throughout this\u00a0post.<\/p>\n<p>The first of the subheadings we will examine within the scope of this post is the analysis of images obtained with the \u201cleft camera\u201d. The next topic will be the 2D image-based object detectors. While these object detectors have a long history and different types like two-stage detectors, single-stage detectors, or Vision-Language Models, we will be analyzing the most popular two techniques: YoloWorld [2], which is an open vocabulary object detector and YoloV8[3], which is a single-stage object detector. In this context, before comparing these object detectors, I will be giving applied examples of how to fine-tune YoloV8 for the KITTI Object detection problem. Afterward, we will compare the models, and yes, we will complete this post by talking about the slice-aided object detection framework, SAHI [4], to solve the problem of detecting small-sized objects that we will see in the\u00a0future.<\/p>\n<p>So let\u2019s start with the data analysis\u00a0part!<\/p>\n<h4>2D Colored Image Dataset Analysis of\u00a0KITTI<\/h4>\n<p>The KITTI 3D Object Detection dataset includes 7481 training and 7581 testing images. And, each training image has a label file that includes the object coordinates in the image plane. These label files are presented in \u201c.txt\u201d format and are organized line-based. And, each row represents the labeled objects in the relevant image. In this context, each row consists of a total of 16 columns (If you are interested in these columns, I highly recommend you take a look at the previous article in this series). But to put it roughly here, the first column indicates the type of the relevant object, and the values between the 5th and 8th columns indicate the location of that object in the image coordinate system. Let me share a sample image and its label file as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A6EhgPneE0jM6CXRcJ3mtZA.png?ssl=1\"><figcaption>A sample 2D colored image (Image taken from\u00a0KITTI)<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/719\/1%2AVtQ6Yu4Rneg0cFRLGhQnrA.png?ssl=1\"><figcaption>The corresponding label file of above image (Label file taken fom\u00a0KITTI)<\/figcaption><\/figure>\n<p>As we can see a lot of cars and three pedestrians are identified in the image. Before getting into the deeper analysis, let me share the object types in KITTI. KITTI has 9 different classes in label files. These are, \u201cCar\u201d, \u201cTruck\u201d, \u201cVan\u201d, \u201cTram\u201d, \u201cPedestrian\u201d, \u201cCyclist\u201d, \u201cPerson_sitting\u201d, \u201cMisc\u201d, and \u201cDontCare\u201d.<\/p>\n<p>While some object types are obvious, \u201cMisc\u201d and \u201cDon\u2019t Care\u201d may seem a little bit confusing. Meanwhile, \u201cMisc\u201d stands for objects that do not fit into the main categories above (Car, pedestrian, cyclist, etc.). They could be traffic cones, small objects, unknown vehicles, or objects that resemble objects but cannot be clearly classified. On the other hand, \u201cDontCare\u201d refers to regions that we should not take into consideration.<\/p>\n<p>After getting informed about the classes, let\u2019s try to visualize the distribution of the main\u00a0classes.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/475\/1%2AbRZS9lG0EqPOTmVAhGdCHQ.png?ssl=1\"><figcaption>The distribution of main classes in KITTI colored\u00a0images<\/figcaption><\/figure>\n<p>As can be seen from the distribution graph, there is an unbalanced distribution in terms of the number of examples contained in the classes. For example, while the number of examples in the \u201cCar\u201d class is much higher than the average number of examples in the classes, the situation is exactly the opposite for the \u201cPerson_sitting\u201d class.<\/p>\n<p>Here I would like to open a parenthesis about these numbers, especially from a statistical learning perspective. Such unbalanced distributions among classes may cause statistical learning methods to underperform or be biased toward some classes. I would like to leave some important keywords that should come to mind in such a situation for readers who want to deal with this subject: sub-sampling, regularization, bias-variance problem, weighted or focal loss, etc. (If you would like a post from me about these concepts, please leave it in the comments.)<\/p>\n<p>Another topic we will investigate in the analysis section will be related to the size of the objects. By size here, I mean the dimensions of the relevant objects in pixels in the image coordinate system. This issue may be overlooked at first, or it may not be understood what kind of positive return measuring this may have. However, the average bounding box size of a certain object type may be inherently much smaller than the box size of other object classes. In this case, we either cannot detect that object type (which happens most of the time) or we can classify it as a different object type (rarely). Then let\u2019s analyze the size distribution of each class as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/275\/1%2AiKQ94evVHS0oCNyXh2JpJw.png?ssl=1\"><figcaption>The bounding box size of each class in KITTI\u00a0dataset<\/figcaption><\/figure>\n<p>If we keep the \u201cMisc\u201d and \u201cDontCare\u201d object types separate, there is a marginal difference between the bounding box sizes of the \u201cPedestrian\u201d, \u201cPerson_sitting\u201d and \u201cCyclist\u201d types and the sizes of the other object types. This gives us a red flag that we may need to make a special effort when identifying these classes. In this context, I will give you some tips in the following sections by opening a special subheading on slicing-aided object detection!<\/p>\n<h4>2D Image-based Object\u00a0Detector<\/h4>\n<p>2D image-based object detectors are computer vision models designed to identify and locate objects within images. These models can be broadly categorized into two-stage and single-stage detectors. In <strong>two-stage detectors<\/strong>, the model first generates potential object proposals through a region proposal network (RPN) or similar mechanisms. Then, in the second stage, these proposals are refined and classified into specific object categories. A popular example of this type is <strong>Faster R-CNN <\/strong>[5]. This approach is known for its high accuracy as it performs a detailed evaluation of potential objects, but it tends to be slower due to the two-step process, which can be a limitation for real-time applications.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/628\/1%2AnaQOJglEzzwOqDzgUCAcUQ.png?ssl=1\"><figcaption>The system architecure of Faster RCNN (Image taken from\u00a0[5])<\/figcaption><\/figure>\n<p>In contrast, <strong>single-stage detectors<\/strong> aim to detect objects in a single pass by directly predicting both object locations and classifications for all potential bounding boxes. This approach is faster and more efficient, making it ideal for real-time detection applications. Examples include <strong>YOLO (You Only Look Once)<\/strong>[3] and <strong>SSD (Single Shot Multibox Detector)<\/strong>[6]. These models divide the image into a grid and predict bounding boxes and class probabilities for each grid cell, resulting in a more streamlined and faster detection process. Although single-stage detectors may trade off some accuracy for speed, they are widely used in applications requiring real-time performance, such as autonomous driving and video surveillance.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Aizkv7ZrtzXf10TMuoOls5g.png?ssl=1\"><figcaption>The system architecure of YoloV8 (Image taken from\u00a0[3])<\/figcaption><\/figure>\n<p>After the introductory information is given let\u2019s dive into to object detectors that are applied to our problem; the first one is YoloWorld[2] and the second one is YoloV8 [3]. Here you may wonder why we are analyzing two different Yolo models. The main point here is that YoloV8 is a single-stage detector, while YoloWorld is a special type of detector that has been studied a lot in recent years with an open keyword, that is, no close set classification model. And it means that, in theory, these models, which are Open Vocabulary Detection-based ones, are capable of detecting any kind of\u00a0object!<\/p>\n<h4>YoloWorld<\/h4>\n<p>YoloWorld is one of the promising studies in the open-vocabulary object detection era. But what exactly is open-vocabulary object detection?<\/p>\n<p>To understand the concept of the open-vocabulary, let\u2019s take a step back and understand the core idea behind traditional object detectors. Sample and simple cornerstones of training a model can be presented as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/971\/1%2ASFTvWc1Rfsr3bygsV2C_pg.png?ssl=1\"><figcaption>A training pipeline of the training\u00a0model<\/figcaption><\/figure>\n<p>In traditional machine learning, a model is trained on <em>n<\/em> different classes, and its performance is evaluated only on those <em>n<\/em> classes. For example, let&#8217;s consider a class that wasn\u2019t included during training, such as &#8220;Bird.&#8221; If we give an image of a bird to the trained model, it will not be able to detect the \u201cBird\u201d in the image. Since the &#8220;Bird&#8221; is not part of the training dataset, the model cannot recognize it as a new class or generalize to understand that it\u2019s something outside its training. In short, traditional models cannot identify or handle classes they haven\u2019t seen during training.<\/p>\n<p>On the other hand, open-vocabulary object detection overcomes this limitation by enabling models to detect objects beyond the classes they were explicitly trained on. This is achieved by leveraging visual-text representations, where models are trained with paired image-text data, such as \u201ca photo of a cat\u201d or \u201ca person riding a bicycle.\u201d Instead of relying solely on fixed class labels, these models learn a more general understanding of objects through their semantic descriptions.<\/p>\n<p>As a result, when presented with a new object class, like \u201cBird,\u201d the model can recognize and classify it by associating the visual features of the object with the textual descriptions, even if the class was not part of its training data. This capability is particularly useful in real-world applications where the variety of objects is vast, and it\u2019s impractical to train models on every possible category.<\/p>\n<p>So how does this mechanism work? In fact, the real magic here is the use of visual and textual information together. So let\u2019s first see the system architecture of YoloWorld and then analyze the core components one by\u00a0one.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AKmWQ8_NGaKQb6kfze6H28g.png?ssl=1\"><figcaption>The system architecture of YoloWorld (Image taken from YoloWorld [2])<\/figcaption><\/figure>\n<p>We can analyze the model from general to specific as follows. YoloWorld takes Image <em>{I}<\/em> and the corresponding texts <em>{T}<\/em> as input then outputs predicted Bounding Boxes <em>{Bk}<\/em> and Object Embeddings <em>{ek}<\/em>.<\/p>\n<p><em>{T}<\/em> is fed into to pre-trained CLIP [7] model to be converted into vocabulary embeddings. On the other hand, YOLO Backbone, which is a visual information encoder, takes <em>{I}<\/em> and extracts multi-scale image features. Right now, two different input types have their own modality-specific embeddings, processed by different encoders. However, \u201cVision-Language PAN\u201d takes both embeddings and creates a kind of multimodality embeddings using a cross-modality fusion approach.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/593\/1%2APYERkSS37LnylXyW5Ip0IA.png?ssl=1\"><figcaption>Visual-Language PAN layer in YoloWorld [2]<\/figcaption><\/figure>\n<p>Let\u2019s go over this layer step-by-step. First {Cx} are the multi-scale visual features. On the top, we have textual embeddings <em>{Tc}. <\/em>Each visual feature follows the <em>Cx <\/em>\u2208 H\u00d7W\u00d7D dimension and each textual feature follows the <em>Tc<\/em> \u2208 CXD dimension. Then multiplication of each component (after reshaping of visual features), there will be an attention score vector, which is formed\u00a01XC.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/276\/1%2AiAE5AZ_1mtJiY4dHPRBoLA.png?ssl=1\"><figcaption>The formula of text-to-image feature fusion in T-CSPLayer<\/figcaption><\/figure>\n<p>Then by normalizing the maximum attention vector and multiplying the visual vector and fusion-based attention vector, we calculate the new form of visual\u00a0vector.<\/p>\n<p>Then these newly formed visual features are fed into the \u201cI-Pooling Attention\u201d layer, which employs the 3&#215;3 max kernels to extract 27 patches. The output of these patches is given to the Multi-Head_Attention mechanism, which is similar to the Transformer arch., to update Image-aware textual embeddings as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/517\/1%2AcfUDUky_gvmOCh68r5GV8A.png?ssl=1\"><figcaption>The formula of I-Pooling Attention layer<\/figcaption><\/figure>\n<p>After these processes, the outputs are formed by two regression heads. The first one is the \u201cText Contrastive Head\u201d and the other one is the \u201cBounding Box Head\u201d. The overall system loss function, to train the model, can be presented as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AR-vjnpQGVfgaW8X28B_oXA.png?ssl=1\"><figcaption>The loss function of YoloWorld<\/figcaption><\/figure>\n<p>Then, now let\u2019s get into the applied section to see the results WITHOUT doing any fine-tuning. After all, we expect this model to make correct determinations even if it is not trained specifically with our KITTI classes, right\u00a0\ud83d\ude0e<\/p>\n<p>As we did in our previous blog post, you can find the complete files, codes, etc. by following the GitHub link, which I provide at the\u00a0bottom.<\/p>\n<p>The first step is model initialization, and defining our classes, which are interested in the KITTI\u00a0problem.<\/p>\n<pre># Load YOLOOpenWorld model (pre-trained on COCO dataset)<br>yoloWorld_model = YOLOWorld(\"yolov8x-worldv2.pt\")<br><br># Define class names to filter<br>target_classes = [\"car\", \"van\", \"truck\", \"pedestrian\", \"person_sitting\", \"cyclist\", \"tram\"]  <br>class_map = {idx:class_name for idx, class_name in enumerate(target_classes)}<br><br>## set the interested classes there<br>yoloWorld_model.set_classes(target_classes)<\/pre>\n<p>The next step is loading a sample image and its G.T. box visualization.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2ASOkKJDjrk-JLSYRIdGm5_w.png?ssl=1\"><figcaption>A sample image from KITTI\u00a0dataset<\/figcaption><\/figure>\n<p>The G.T. bounding boxes for our sample are as follows. More specifically, the G.T. label includes, 9 cars and 3 pedestrians! (such a complex\u00a0scene)<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2A18P2Kc4GXSJSh44N2-06rA.png?ssl=1\"><figcaption>The G.T. Bounding Boxes of the sample\u00a0image<\/figcaption><\/figure>\n<p>Before getting into the YoloWorld prediction, let me reiterate that we did not make any fine-tuning to the YoloWorld model, we took the model as is. The prediction with it can be done as\u00a0follows.<\/p>\n<pre>## 2. Perform detection and detection list arrangement<br>det_boxes, det_class_ids, det_scores = utils.perform_detection_and_nms(yoloWorld_model, sample_image, det_conf= 0.35, nms_thresh= 0.25)<\/pre>\n<p>The output of the prediction is as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2Ay2HfP6LasEA3KKBtGgpqLg.png?ssl=1\"><figcaption>The prediction of off-the-shelf YoloWorld model for the sample\u00a0image<\/figcaption><\/figure>\n<p>Regarding the prediction, we can see that there are 6 cars class and 1 van class found. The evaluation of the output can be done as\u00a0follows.<\/p>\n<pre>## 4. Evaluate the predicted detections with G.T. detections<br>print(\"# predicted boxes: {}\".format(len(pred_detections)))<br>print(\"# G.T. boxes: {}\".format(len(gt_detections)))<br>tp, fp, fn, tp_boxes, fp_boxes, fn_boxes = utils.evaluate_detections(pred_detections, gt_detections, iou_threshold=0.40)<br>pred_precision, pred_recall = utils.calculate_precision_recall(tp, fp, fn)<br>print(f\"TP: {tp}, FP: {fp}, FN: {fn}\")<br>print(f\"Precision: {pred_precision}, Recall: {pred_recall}\")<br><\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/342\/1%2AuV5KuGea0DM5hvvk1yjXtw.png?ssl=1\"><figcaption>The evaluation metric score for the prediction with the YoloWorld model<\/figcaption><\/figure>\n<p>Now as we can, 1 object is identified but misclassified (the actual class is \u201cCar\u201d but classified as \u201cVan\u201d). Then in total, 6 boxes couldn\u2019t be found. Then it makes our recall score 0.5 and precision score\u00a0~0.86.<\/p>\n<p>Let me share some other predicted figures with you as examples.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ApwDxuLB4Cm1cIi1cbSW19Q.png?ssl=1\"><figcaption>Some other examples for YoloWorld model<\/figcaption><\/figure>\n<p>While the first row refers to the predicted samples, the second one represents the G.T. boxes and classes. On the left side, we can see a pedestrian who walks from left to right. Fortunately, YoloWorld predicted the object perfectly in terms of bounding box dimensions, but the class is predicted as \u201cPedestrian_sitting\u201d while the G.T. label is \u201cPedestrian\u201d. This is why precision and recall are both 0.0\u00a0:\/<\/p>\n<p>On the right side, YoloWorld predicts 2 \u201cCars\u201d while G.T. has only 1 \u201cCar\u201d. For this reason, the precision score is 0.5 and the recall score is\u00a01.0<\/p>\n<p>So for now, we have seen a couple of Yolo predictions, and the model can be somehow acceptable as an initial step, can\u2019t\u00a0it?<\/p>\n<p>We have to admit that an improvement is definitely needed for the model with such a critical application area. However, it should not be forgotten that we were able to achieve some adequate results even without fine-tuning here!<\/p>\n<p>And then that requirement leads us to our next step, which is the traditional model, the YoloV8, and the fine-tuning of it. Let\u2019s\u00a0go!<\/p>\n<h4>YoloV8<\/h4>\n<p>YOLOv8 (You Only Look Once version 8) is the one of most advanced versions in the YOLO family of object detection models, designed to push the boundaries of speed, accuracy, and flexibility in computer vision tasks. Building on the success of its predecessors, YOLOv8 integrates innovative features such as anchor-free detection mechanisms and decoupled detection heads to streamline the object detection pipeline. These enhancements reduce computational overhead while improving the detection of objects across varying scales and complex scenarios. Moreover, YOLOv8 introduces dynamic task adaptability, allowing it to perform not just object detection but also image segmentation and classification seamlessly. This versatility makes it a go-to solution for diverse real-world applications, from autonomous vehicles and surveillance to medical imaging and retail analytics.<\/p>\n<p>What sets YOLOv8 apart is its focus on modern deep learning trends, such as optimized training pipelines, state-of-the-art loss functions, and model scaling strategies. The inclusion of anchor-free detection eliminates the need for predefined anchor boxes, making the model more robust to varying object shapes and reducing the chances of false negatives. The decoupled head design separately optimizes classification and regression tasks, improving overall detection accuracy. In addition, YOLOv8\u2019s lightweight architecture ensures faster inference times without compromising on performance, making it suitable for deployment on edge devices. Overall, YOLOv8 continues the YOLO legacy by providing a highly efficient and accurate solution for a wide range of computer vision\u00a0tasks.<\/p>\n<p>For more in-depth analysis and implementation details, refer\u00a0to:<\/p>\n<ol>\n<li>Yolov8 Medium post: <a href=\"https:\/\/abintimilsina.medium.com\/yolov8-architecture-explained-a5e90a560ce5\">https:\/\/docs.ultralytics.com\/<\/a>\n<\/li>\n<li>An exploration article: <a href=\"https:\/\/arxiv.org\/pdf\/2408.15857\">https:\/\/arxiv.org\/pdf\/2408.15857<\/a>\n<\/li>\n<\/ol>\n<p>But before getting into the next step, where we\u2019re going to fine-tune the Yolo model for our problem, let\u2019s visualize the output of the off-the-shelf YoloV8 model on our sample image. (Of course, the off-the-shelf model doesn\u2019t cover all the classes of our problem, but at least it can detect the cars and pedestrians that we need for our sample\u00a0image)<\/p>\n<pre>## Load the off-the-shelf yolo model and get the class name mapping dict<br>off_the_shelf_model = YOLO(\"yolov8m.pt\")<br>off_the_shelf_class_names = off_the_shelf_model.names<br><br>## then make a prediction as we did before<br>det_boxes, det_class_ids, det_scores = utils.perform_detection_and_nms(off_the_shelf_model, sample_image, det_conf= 0.35, nms_thresh= 0.25)<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2Abz4I2wyhbH3PMF5Psp4fVQ.png?ssl=1\"><figcaption>The predicted output of the off-the-shelf YoloV8-m\u00a0model<\/figcaption><\/figure>\n<p>The off-the-shelf model predicts 8 cars, which is almost okay! Only 1 car and 1 pedestrian are missing, but that is also okay for\u00a0now.<\/p>\n<p>Then let\u2019s try to fine-tune that off-the-shelf model to adapt it to our\u00a0problem.<\/p>\n<h4>YoloV8 Fine-Tuning<\/h4>\n<p>In this section, we will fine-tune the off-the-shelf YoloV8-m model to fit our problem well. But before that, we need to adjust the proper label files. I know it\u2019s not the funniest part, but it\u2019s a mandatory thing to do before seeing the progress bar in the fine-tuning stage. To make it available, I prepared the following function, which is available in my Github repo like all other components.<\/p>\n<pre>def convert_label_format(label_path, image_path, class_names=None):<br>    \"\"\"<br>    Converts a custom label format into YOLO label format. <br><br>    This function takes a path to a label file and the corresponding image file, processes the label information, <br>    and outputs the annotations in YOLO format. YOLO format represents bounding boxes with normalized values <br>    relative to the image dimensions and includes a class ID.<br><br>    Key Parameters:<br>    - `label_path` (str): Path to the label file in custom format.<br>    - `image_path` (str): Path to the corresponding image file.<br>    - `class_names` (list or set, optional): A collection of class names. If not provided, <br>    the function will create a set of unique class names encountered in the labels.<br><br>    Processing Details:<br>    1. Reads the image dimensions to normalize bounding box coordinates.<br>    2. Filters out labels that do not match predefined classes (e.g., car, pedestrian, etc.).<br>    3. Converts bounding box coordinates from the custom format to YOLO's normalized center-x, center-y, width, and height format.<br>    4. Updates or utilizes the provided `class_names` to assign a class ID for each annotation.<br><br>    Returns:<br>    - `yolo_lines` (list): List of strings, each in YOLO format (&lt;class_id&gt; &lt;x_center&gt; &lt;y_center&gt; &lt;width&gt; &lt;height&gt;).<br>    - `class_names` (set or list): Updated set or list of unique class names.<br><br>    Notes:<br>    - The function assumes specific indices (4 to 7) for bounding box coordinates in the input label file.<br>    - Normalization is based on the dimensions of the input image.<br>    - Class filtering is limited to a predefined set of relevant classes.<br>    \"\"\"<\/pre>\n<p>A sample label file after this operation will look as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/660\/1%2An_3tYInBeb5DL_4OM0QxQw.png?ssl=1\"><figcaption>A Yolo oriented label file for the sample\u00a0image<\/figcaption><\/figure>\n<p>The first &lt;int&gt; shows the class id, and the following 4 &lt;float&gt; shows the coordinates. And after, we need to create a \u201c.ymal\u201d file that shows the location of the label files, the split of training and validation sets, and the corresponding images. The same thing, I prepared the required function\u00a0too.<\/p>\n<pre>def create_data_yaml(images_path, labels_path, base_path, train_ratio=0.8):<br>    \"\"\"<br>    Creates a dataset directory structure with train and validation splits for YOLO format.<br><br>    This function organizes image and label files into separate training and validation directories,<br>    converts label files to the YOLO format, and ensures the output structure adheres to YOLO conventions.<br><br>    Key Parameters:<br>    - `images_path` (str): Path to the directory containing the image files.<br>    - `labels_path` (str): Path to the directory containing the label files in custom format.<br>    - `base_path` (str): Base directory where the train\/val split directories will be created.<br>    - `train_ratio` (float, optional): Ratio of images to allocate for training (default is 0.8).<br><br>    Processing Details:<br>    1. **Dataset Splitting**:<br>    - Reads all image files from `images_path` and splits them into training and validation sets <br>        based on `train_ratio`.<br>    2. **Directory Creation**:<br>    - Creates the necessary directory structure for train\/val splits, including `images` and `labels` subdirectories.<br>    3. **Label Conversion**:<br>    - Uses `convert_label_format` to convert label files to YOLO format.<br>    - Updates a set of unique class names encountered in the labels.<br>    4. **File Organization**:<br>    - Copies image files into their respective directories (train or val).<br>    - Writes the converted YOLO labels into the appropriate `labels` subdirectory.<br><br>    Returns:<br>    - None (operates directly on the file system to organize the dataset).<br><br>    Notes:<br>    - The function assumes labels correspond to image files with the same name (except for the file extension).<br>    - Handles label conversion using a predefined set of class names, ensuring consistency.<br>    - Uses `shutil.copy` for images to avoid removing original files.<br><br>    Dependencies:<br>    - Requires `convert_label_format` to be implemented for proper label conversion.<br>    - Relies on `os`, `shutil`, `Path`, and `tqdm` libraries.<br><br>    Usage Example:<br>    ```python<br>    create_data_yaml(<br>        images_path='\/path\/to\/images',<br>        labels_path='\/path\/to\/labels',<br>        base_path='\/output\/dataset',<br>        train_ratio=0.8<br>    )<br>    \"\"\"<\/pre>\n<p>Then, it\u2019s time to fine-tune our\u00a0model!<\/p>\n<pre>def train_yolo_world(data_yaml_path, epochs=100):<br>    \"\"\"<br>    Trains a YOLOv8 model on a custom dataset.<br><br>    This function leverages the YOLOv8 framework to fine-tune a pretrained model using a specified dataset<br>    and training configuration.<br><br>    Key Parameters:<br>    - `data_yaml_path` (str): Path to the YAML file containing dataset configuration (e.g., paths to train\/val splits, class names).<br>    - `epochs` (int, optional): Number of training epochs (default is 100).<br><br>    Processing Details:<br>    1. **Model Initialization**:<br>    - Loads the YOLOv8 medium-sized model (`yolov8m.pt`) as a base model for training.<br>    2. **Training Configuration**:<br>    - Defines training hyperparameters including image size, batch size, device, number of workers, and early stopping (`patience`).<br>    - Results are saved to a project directory (`yolo_runs`) with a specific run name (`fine_tuning`).<br>    3. **Training Execution**:<br>    - Initiates the training process and tracks metrics such as loss and mAP.<br><br>    Returns:<br>    - `results`: Training results, including metrics for evaluation and performance tracking.<br><br>    Notes:<br>    - Assumes that the YOLOv8 framework is properly installed and accessible via `YOLO`.<br>    - The dataset YAML file must include paths to the training and validation datasets, as well as class names.<br><br>    Dependencies:<br>    - Requires the `YOLO` class from the YOLOv8 framework.<br><br>    Usage Example:<br>    ```python<br>    results = train_yolo_world(<br>        data_yaml_path='path\/to\/data.yaml',<br>        epochs=50<br>    )<br>    print(results)<br>    \"\"\"<\/pre>\n<p>In that stage, I used to default fine-tuning parameters, which are defined here: <a href=\"https:\/\/docs.ultralytics.com\/models\/yolov8\/#can-i-benchmark-yolov8-models-for-performance\">https:\/\/docs.ultralytics.com\/models\/yolov8\/#can-i-benchmark-yolov8-models-for-performance<\/a><\/p>\n<p>But I <strong>HIGHLY <\/strong>encourage you to try other hyper-parameters like learning rate, optimizer, etc. Since those parameters directly affect the output performance of the model, they are so\u00a0crucial.<\/p>\n<p>Anyway, let\u2019s try to keep it simple for now, and jump into the output performance of our fine-tuned model for KITTI\u2019s main\u00a0classes.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A_Qy7gsAIaIh-HdcMuhah7Q.png?ssl=1\"><figcaption>The output performance of the fine-tuned YoloV8-m model on validation set<\/figcaption><\/figure>\n<p>As we can see, the overall mAP50 is 0.835, which is good for the first shoot. But the \u201cPerson_sitting\u201d and \u201cPedestrian\u201d classes, which are important ones in autonomous driving do not hit, show 0.61 and 0.75 mAP50 scores. There could be some reasons behind it; their bounding box dimensions are relatively smaller than the others and the other reason could be the number of samples of these classes. Of course, there are some others like \u201cCyclist\u201d and \u201cTram\u201d that have a couple of images too, but yeah it\u2019s kind of a black box. If you want me to investigate this behavior in deep, please indicate it in the comments. It would be a pleasure for\u00a0me!<\/p>\n<p>As we did in the previous sections let me share the result of the sample image again for the fine-tuned model\u00a0here.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2AulD3Kt8t_om9fnCIWe6uYg.png?ssl=1\"><figcaption>The output of the fine-tuned model on the sample\u00a0image<\/figcaption><\/figure>\n<p>Now, the fine-tuned model detected 2 pedestrians, 1 cyclist, 9 cars! It\u2019s almost done for that sample image. Cause this detection means\u00a0that;<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/341\/1%2AjQiemK2sy5wb5YSywxBMPw.png?ssl=1\"><figcaption>The evaluation metric score for the prediction with the fine-tuned model<\/figcaption><\/figure>\n<p>It\u2019s much better than the off-the-shelf model (even if we haven\u2019t done too much hyper-parameter searching!). Then let me share another image with\u00a0you.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AzmRH0-MsXM7XTAeBh9DEWQ.png?ssl=1\"><figcaption>Another sample image (raw version, Image taken from KITTI\u00a0[1])<\/figcaption><\/figure>\n<p>Now, in that scene, there is a car on the left side. But wait! There are some others around there, but they are too small to\u00a0see.<\/p>\n<p>Let\u2019s check our fancy fine-tuned model\u00a0output!<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2AsrEz_7kVEQjTTZu0JMB6Eg.png?ssl=1\"><figcaption>The output of the fine-tuned model on the second sample\u00a0image<\/figcaption><\/figure>\n<p>OMG! It only detects the car and a cyclist who is right behind it. How about the others who are staying right of the cyclist? Yeah, now this situation takes us to our next and final topic: detecting small-sized objects in the 2D image. Let\u2019s\u00a0go.<\/p>\n<h4>Dealing with Small-sized Objects<\/h4>\n<p>KITTI images have 1342 pixels on the width and 375 pixels on the height side. Then applying them a resizing operation just before feeding to the model, makes them 640 by 640. Let me show you a visual that is right before feeding to the model as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AHtdyrt-d79UBlGR79pYYtw.png?ssl=1\"><figcaption>The left one is the original raw image, the right one is the resized version of it (Images are taken from KITTI\u00a0[1])<\/figcaption><\/figure>\n<p>We can see that some objects are severely distorted. In addition, we can observe that some objects farther from the camera become even smaller. There is a method that we can use to overcome the problems experienced in both these types of situations and in detecting objects in very high-resolution images. And its name is \u201cSAHI\u201d [4], Slicing Aided Hyper Inference. Its core concept is so clear; it divides images into smaller, manageable slices, performs object detection on each slice, and merges the results seamlessly.<\/p>\n<p>However, running the object detection model repeatedly on multiple slices and combining the results would, as can be expected, require significant computational power and time. However, SAHI is able to overcome this with its optimizations and memory usage! In addition, its compatibility with many different object detectors makes it suitable for practical work.<\/p>\n<p>Here are some links to understand SAHI in depth and observe its performance enhancements for different problems:<\/p>\n<p>\u2014 SAHI Paper: <a href=\"https:\/\/arxiv.org\/pdf\/2202.06934\">https:\/\/arxiv.org\/pdf\/2202.06934<\/a><\/p>\n<p>\u2014 SAHI GitHub: <a href=\"https:\/\/github.com\/obss\/sahi\">https:\/\/github.com\/obss\/sahi<\/a><\/p>\n<p>Then let\u2019s visualize our second sample image with SAHI-based inference:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ASjCEL-9EHqfmiOW9M3BkXQ.png?ssl=1\"><figcaption>The output of the fine-tuned model with SAHI on another sample\u00a0image<\/figcaption><\/figure>\n<p>Wow! We can see that several cars and a cyclist are found perfectly! If you also face the same kind of problem like this, please check the paper and the implementation!<\/p>\n<h4>Conclusion<\/h4>\n<p>Well, now we have finally come to the end. During this process, we first tried to solve Lidar-based obstacle detection with an unsupervised learning algorithm in our first article. In this article, we used different object detection algorithms. Among these, the \u201copen-vocabulary\u201d based YoloWorld, or the more traditional \u201cclose-set\u201d object detection model YoloV8, and the \u201cfine-tuned\u201d version of YoloV8, which is more suitable for the KITTI problem. In addition, we obtained some results with the help of \u201cSAHI\u201d regarding the detection of small-sized objects.<\/p>\n<p>Of course, each topic we mentioned is an active research area. And many researchers are still trying to achieve more successful results in these areas. Here, we tried to produce solutions from the perspective of the applied scientist.<\/p>\n<p>However, if there is a topic you want me to talk about more or if you want a completely different article about some parts, please indicate this in the comments.<\/p>\n<h4>What\u2019s next?<\/h4>\n<p>Then, for now, let\u2019s meet in the next publication, which will be the last article of the series, where we will detect obstacles with both Lidar and color images using both sensors at the same\u00a0time.<\/p>\n<blockquote><p><strong>Any comments, error fixes, or improvements are\u00a0welcome!<\/strong><\/p><\/blockquote>\n<blockquote><p><strong><em>Thank you all and I wish you healthy\u00a0days.<\/em><\/strong><\/p><\/blockquote>\n<p>********************************************************************************************************************************************************<\/p>\n<p><strong><em>GitHub link<\/em><\/strong>: <a href=\"https:\/\/github.com\/ErolCitak\/KITTI-Sensor-Fusion\/tree\/main\/color_image_based_object_detection\">https:\/\/github.com\/ErolCitak\/KITTI-Sensor-Fusion\/tree\/main\/color_image_based_object_detection<\/a><\/p>\n<p><strong>References:<\/strong><\/p>\n<p>[1] <a href=\"https:\/\/www.cvlibs.net\/datasets\/kitti\/\">https:\/\/www.cvlibs.net\/datasets\/kitti\/<\/a><\/p>\n<p>[2] <a href=\"https:\/\/docs.ultralytics.com\/models\/yolo-world\/\">https:\/\/docs.ultralytics.com\/models\/yolo-world\/<\/a><\/p>\n<p>[3] <a href=\"https:\/\/docs.ultralytics.com\/models\/yolov8\/\">https:\/\/docs.ultralytics.com\/models\/yolov8\/<\/a><\/p>\n<p>[4] <a href=\"https:\/\/github.com\/obss\/sahi\">https:\/\/github.com\/obss\/sahi<\/a><\/p>\n<p>[5] <a href=\"https:\/\/arxiv.org\/abs\/1506.01497\">https:\/\/arxiv.org\/abs\/1506.01497<\/a><\/p>\n<p>[6] <a href=\"https:\/\/arxiv.org\/abs\/1512.02325\">https:\/\/arxiv.org\/abs\/1512.02325<\/a><\/p>\n<p>[7] <a href=\"https:\/\/openai.com\/index\/clip\/\">https:\/\/openai.com\/index\/clip\/<\/a><\/p>\n<h3>Disclaimer<\/h3>\n<p>The images used in this blog series are taken from the KITTI dataset for education and research purposes. If you want to use it for similar purposes, you must go to the relevant website, approve the intended use there, and use the citations defined by the benchmark creators as\u00a0follows.<\/p>\n<p>For the <strong>stereo 2012<\/strong>, <strong>flow 2012<\/strong>, <strong>odometry<\/strong>, <strong>object detection,<\/strong> or <strong>tracking benchmarks<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Geiger2012CVPR.pdf\">Geiger2012CVPR<\/a>,<br \/>author = {<a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_lenz.php\">Philip Lenz<\/a> and <a href=\"http:\/\/ttic.uchicago.edu\/~rurtasun\">Raquel Urtasun<\/a>},<br \/>title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},<br \/>booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},<br \/>year = {2012}<br \/>}<\/p>\n<p>For the <strong>raw dataset<\/strong>, please cite:<br \/>@article{<a href=\"https:\/\/www.cvlibs.net\/publications\/Geiger2013IJRR.pdf\">Geiger2013IJRR<\/a>,<br \/>author = {<a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_lenz.php\">Philip Lenz<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_stiller.php\">Christoph Stiller<\/a> and <a href=\"http:\/\/ttic.uchicago.edu\/~rurtasun\">Raquel Urtasun<\/a>},<br \/>title = {Vision meets Robotics: The KITTI Dataset},<br \/>journal = {International Journal of Robotics Research (IJRR)},<br \/>year = {2013}<br \/>}<\/p>\n<p>For the <strong>road benchmark<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Fritsch2013ITSC.pdf\">Fritsch2013ITSC<\/a>,<br \/>author = {Jannik Fritsch and Tobias Kuehnl and <a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a>},<br \/>title = {A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms},<br \/>booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},<br \/>year = {2013}<br \/>}<\/p>\n<p>For the <strong>stereo 2015<\/strong>, <strong>flow 2015,<\/strong> and <strong>scene flow 2015 benchmarks<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Menze2015CVPR.pdf\">Menze2015CVPR<\/a>,<br \/>author = {<a href=\"http:\/\/www.ipi.uni-hannover.de\/tmm.html\">Moritz Menze<\/a> and <a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a>},<br \/>title = {Object Scene Flow for Autonomous Vehicles},<br \/>booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},<br \/>year = {2015}<br \/>}<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=5118be4e92ee\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/mastering-sensor-fusion-color-image-obstacle-detection-with-kitti-data-part-2-5118be4e92ee\">Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart 2<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Erol \u00c7\u0131tak<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fmastering-sensor-fusion-color-image-obstacle-detection-with-kitti-data-part-2-5118be4e92ee\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart 2 Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data\u200a\u2014\u200aPart\u00a02 How to use color image data for object detection in the context of obstacle detection The concept of sensor fusion is a decision-making mechanism that can be applied to different problems and using different [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1076,221,88,301,1077],"tags":[1079,489,1078],"class_list":["post-938","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-autonomous-cars","category-computer-vision","category-deep-learning","category-object-detection","category-sensor-fusion","tag-color","tag-detection","tag-obstacle"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/938"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=938"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/938\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=938"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=938"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=938"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}