{"id":937,"date":"2025-01-03T07:02:48","date_gmt":"2025-01-03T07:02:48","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/03\/sensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497\/"},"modified":"2025-01-03T07:02:48","modified_gmt":"2025-01-03T07:02:48","slug":"sensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/03\/sensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497\/","title":{"rendered":"Sensor Fusion\u200a\u2014\u200aKITTI\u200a\u2014\u200a\u2018Lidar-based Obstacle Detection\u2019\u200a\u2014\u200aPart-1"},"content":{"rendered":"<p>    Sensor Fusion\u200a\u2014\u200aKITTI\u200a\u2014\u200a\u2018Lidar-based Obstacle Detection\u2019\u200a\u2014\u200aPart-1<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h3>Mastering Sensor Fusion: LiDAR Obstacle Detection with KITTI Data\u200a\u2014\u200aPart\u00a01<\/h3>\n<h4>How to use Lidar data for obstacle detection with unsupervised learning<\/h4>\n<p><strong>Sensor fusion, multi-modal perception, autonomous vehicles<\/strong>\u200a\u2014\u200aif these keywords pique your interest, this Medium blog is for you. Join me as I explore the fascinating world of LiDAR and color image-based environment understanding, showcasing how these technologies are combined to enhance obstacle detection and decision-making for autonomous vehicles. This blog and the following series dive into practical implementations and theoretical insights, offering an engaging read for all curious\u00a0eyes.<\/p>\n<p>In this Medium blog series, we will examine the KITTI 3D Object Detection dataset [1][3] in three distinct parts. In the first article, which is this one, we will be talking about the KITTI Velodyne Lidar sensor and single-mode obstacle detection with this sensor only. In the second article of the series, we will be working on detection studies on color images with a uni-modal approach. In the last article of the series, we will work on multi-modal object detection, which can be called sensor fusion. During that process, both the Lidar and the Color Image sensors come into play to work together.<\/p>\n<p>One last note before we get into the topic! I promise that I will provide all the theoretical information about each subtopic at a basic level throughout this series\u00a0\ud83d\ude42 However, I will also be leaving very high-quality references for each subtopic without forgetting those who want more in-depth information.<\/p>\n<h3>Introduction<\/h3>\n<p>KITTI or KITTI Vision Benchmark Suite is a project created in collaboration with Karlsruhe Institute of Technology and Toyota Research Institute. We can say that it is a platform that includes many different test scenarios, including 2D\/3D object detection, multi-object tracking, semantic segmentation, and so\u00a0forth.<\/p>\n<p>For 3D object detection, which is the subject of this article series, there are 7481 training and 7518 test data from different sensors, which are Velodyne Lidar Sensor and Stereo Color Image\u00a0Sensors.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/935\/1%2AT9ckW3mjIxbOTMtuDPP7VQ.jpeg?ssl=1\"><figcaption>A sample image for 3D Object Detection [3] (Image Taken from <a href=\"https:\/\/www.cvlibs.net\/datasets\/kitti\/eval_object.php?obj_benchmark=3d\">https:\/\/www.cvlibs.net\/datasets\/kitti\/eval_object.php?obj_benchmark=3d<\/a>)<\/figcaption><\/figure>\n<p>In this blog post, we will perform obstacle detection using Velodyne Lidar point clouds. In this context, reading point clouds, visualization, and segmentation with the help of unsupervised machine learning algorithms will be the main topics. In addition to these, we will talk a lot about camera calibration and its internal and external parameters, the RANSAC algorithm for vehicle path detection, and basic evaluation metrics to measure the performance of the outputs that we will need while performing these\u00a0steps.<\/p>\n<blockquote><p>Also, I will be using Python language throughout this series, but don\u2019t worry, I will share with you the information about the virtual environment I use. This way, you can quickly get your own environment up and running. Please check the Github repo to get the <strong>requirements.txt<\/strong> file.<\/p><\/blockquote>\n<h4><strong>Problem Definition<\/strong><\/h4>\n<p>The main goal of this blog post is to detect obstacles in the environment detected by the sensor using the unsupervised learning method on point clouds obtained with the Veloydne Lidar in the KITTI\u00a0dataset.<\/p>\n<p>Within this scope, I am sharing an example Lidar point cloud image below to visualize the problem. If we analyze the following sample point cloud, we can easily recognize some cars at the left bottom or some other objects on the\u00a0road.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AbIr5op8vsADH8U4wtIi2AQ.png?ssl=1\"><figcaption>A sample Lidar point cloud [3] ( from KITTI\u00a0dataset)<\/figcaption><\/figure>\n<p>To make it more visible, let me draw some arrows and boxes to show them. In the following image, red arrows indicate cars, orange arrows stand for pedestrians, and red boxes are drawn for street\u00a0lambs.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"A sample Lidar point cloud ( from KITTI dataset)\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ADJSmTkXHWC7YfgbmB8EGpg.png?ssl=1\"><figcaption>A sample Lidar point cloud [3] ( from KITTI\u00a0dataset)<\/figcaption><\/figure>\n<p>Then, you may wonder and ask this question <strong><em>\u201cWouldn\u2019t we also say that there are other objects around, perhaps walls or trees?\u201d <\/em><\/strong>The answer is YES! The proof of my answer can be obtained from the color image corresponding to this point cloud. As can be seen from the image below, there are people, a car, street lights, and trees on the\u00a0scene.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/560\/1%2A5foIyWJbRyjfBaFW_BqTZA.png?ssl=1\"><figcaption>A sample color image [3] ( from KITTI\u00a0dataset)<\/figcaption><\/figure>\n<p>After this visual analysis, we come to a subject that careful readers will immediately notice. While the Lidar point cloud provides a 360-degree view of the scene, color image only provides a limited wide perception of the scene. The following blog will be taking only this colored image into consideration for object detection and the last one will try to fuse Lidar point cloud and color image sensors to handle the problem <strong><em>(I hope they will be available soon!)<\/em><\/strong><\/p>\n<h4>Sensor Setup<\/h4>\n<p>Then let\u2019s talk about the sensors and their installations and so on. The KITTI 3D object detection dataset was collected using a specially modified Volkswagen Passat B6. Data recording was handled by an eight-core i7 computer with a RAID system, running Ubuntu Linux alongside a real-time database for efficient data management.<\/p>\n<p>The following sensors were used for data collection:<\/p>\n<ul>\n<li>\n<strong>Inertial Navigation System (GPS\/IMU):<\/strong> OXTS RT\u00a03003<\/li>\n<li>\n<strong>Lidar Sensor:<\/strong> Velodyne\u00a0HDL-64E<\/li>\n<li>\n<strong>Grayscale Cameras:<\/strong> Two Point Grey Flea 2 (FL2\u201314S3M-C), each with 1.4 Megapixels<\/li>\n<li>\n<strong>Color Cameras:<\/strong> Two Point Grey Flea 2 (FL2\u201314S3C-C), each with 1.4 Megapixels<\/li>\n<li>\n<strong>Varifocal Lenses:<\/strong> Four Edmund Optics NT59\u2013917 (4\u20138\u00a0mm)<\/li>\n<\/ul>\n<p>The visualization of the aforementioned setup is presented in the following figure.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AJ7Mu6drn_781Qp_KyYVNvA.png?ssl=1\"><figcaption>KITTI dataset setup visualization [3] ( Image Taken from\u00a0KITTI)<\/figcaption><\/figure>\n<p>The Velodyne Lidar sensor and the Color cameras are installed on top of the car but their height from the ground and their coordinates are different than each other. No worries! As promised, we will go step by step. It means that, before getting the core of the algorithm of this blog post, we need to revisit the camera calibration topic\u00a0first!<\/p>\n<h4>Camera Calibration<\/h4>\n<p>Cameras, or sensors in a broader sense, provide perceptual outputs of the surrounding environment in different ways. In this concept, let\u2019s take an RGB camera, it could be your webcam or maybe a professional digital compact camera. It projects 3D points in the world onto a 2D image plane using two sets of parameters; the intrinsic and extrinsic parameters.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/375\/1%2ATZm3TXySEjKHd75cZ5Slcg.png?ssl=1\"><figcaption>Projection of 3D points in the world to the 2D image plane ( Image taken from: <a href=\"https:\/\/de.mathworks.com\/help\/vision\/ug\/camera-calibration.html\">https:\/\/de.mathworks.com\/help\/vision\/ug\/camera-calibration.html<\/a>)<\/figcaption><\/figure>\n<p>While the extrinsic parameters are about the location and the orientation of the camera in the world frame domain, the intrinsic parameters map the camera coordinates to the pixel coordinates in the image\u00a0frame.<\/p>\n<p>In this concept, the camera extrinsic parameters can be represented as a matrix like T = [R | t ] where R stands for the rotation matrix, which is 3&#215;3 and t stands for the translation vector, which is 3&#215;1. As a result, the T matrix is a 3&#215;4 matrix that takes a point in the world and maps it to the \u2018camera coordinate\u2019 domain.<\/p>\n<p>On the other hand, the camera&#8217;s intrinsic parameters can be represented as a 3&#215;3 matrix. The corresponding matrix, K, can be given as follows. While fx and fy represent the focal length of the camera, cx and cy stand for principal points, and s indicates the skewness of the\u00a0pixel.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/78\/1%2ATFix7qc3QKMm7NOh8vTPFA.png?ssl=1\"><figcaption>The camera\u2019s intrinsic parameters<\/figcaption><\/figure>\n<p>As a result, any 3D point can be projectable to the 2D image plane via following complete camera\u00a0matrix.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/864\/1%2ARpLdAM4IT3DSrznUMznMLQ.png?ssl=1\"><figcaption>The complete camera matrix to project a 3D world point into the image\u00a0plane<\/figcaption><\/figure>\n<p>I know that camera calibration seems a little bit complicated especially if you encounter it for the first time. But I have searched for some really good references for you. Also, I will be talking about the applied camera calibration operations for our problem in the following sections.<\/p>\n<p>References for the camera calibration topic:<\/p>\n<p><em>\u2014 Carnegie Mellon University, <\/em><a href=\"https:\/\/www.cs.cmu.edu\/~16385\/s17\/Slides\/11.1_Camera_matrix.pdf\"><em>https:\/\/www.cs.cmu.edu\/~16385\/s17\/Slides\/11.1_Camera_matrix.pdf<\/em><\/a><\/p>\n<p><em>\u2014 Columbia University, <\/em><a href=\"https:\/\/www.youtube.com\/watch?v=GUbWsXU1mac\"><em>https:\/\/www.youtube.com\/watch?v=GUbWsXU1mac<\/em><\/a><\/p>\n<p><em>\u2014 Camera Calibration Medium Post, <\/em><a href=\"https:\/\/yagmurcigdemaktas.medium.com\/visual-perception-camera-calibration-9108f8be789\"><em>https:\/\/yagmurcigdemaktas.medium.com\/visual-perception-camera-calibration-9108f8be789<\/em><\/a><\/p>\n<h4>Dataset Understanding<\/h4>\n<p>After a couple of terminologies and the required basic theory, now we are able to get into the\u00a0problem.<\/p>\n<p>First of all, I highly suggest you download the dataset from here [2] for the following ones;<\/p>\n<ul>\n<li>Left Color Images (size is\u00a012GB)<\/li>\n<li>Velodyne Point Cloud (size is\u00a029GB)<\/li>\n<li>Camera Calibration Matrices of the Object Dataset (size is negligible)<\/li>\n<li>Training Labels (size is negligible)<\/li>\n<\/ul>\n<p>The data that we are going to analyze is the ground truth (G.T.)label files. G.T. files are presented in \u2018.txt\u2019 format and each object is labeled with 15 different fields. No worries, I prepared a detailed G.T. file read function in my Github repo as\u00a0follows.<\/p>\n<pre>def parse_label_file(label_file_path):<br>    \"\"\"<br>    KITTI 3D Object Detection Label Fields:<br><br>    Each line in the label file corresponds to one object in the scene and contains 15 fields:<br><br>    1. Type (string):<br>    - The type of object (e.g., Car, Van, Truck, Pedestrian, Cyclist, etc.).<br>    - \"DontCare\" indicates regions to ignore during training.<br><br>    2. Truncated (float):<br>    - Value between 0 and 1 indicating how truncated the object is.<br>    - 0: Fully visible, 1: Completely truncated (partially outside the image).<br><br>    3. Occluded (integer):<br>    - Level of occlusion:<br>        0: Fully visible.<br>        1: Partly occluded.<br>        2: Largely occluded.<br>        3: Fully occluded (annotated based on prior knowledge).<br><br>    4. Alpha (float):<br>    - Observation angle of the object in the image plane, ranging from [-\u03c0, \u03c0].<br>    - Encodes the orientation of the object relative to the camera plane.<br><br>    5. Bounding Box (4 floats):<br>    - (xmin, ymin, xmax, ymax) in pixels.<br>    - Defines the 2D bounding box in the image plane.<br><br>    6. Dimensions (3 floats):<br>    - (height, width, length) in meters.<br>    - Dimensions of the object in the 3D world.<br><br>    7. Location (3 floats):<br>    - (x, y, z) in meters.<br>    - 3D coordinates of the object center in the camera coordinate system:<br>        - x: Right, y: Down, z: Forward.<br><br>    8. Rotation_y (float):<br>    - Rotation around the Y-axis in camera coordinates, ranging from [-\u03c0, \u03c0].<br>    - Defines the orientation of the object in 3D space.<br><br>    9. Score (float) [optional]:<br>    - Confidence score for detections (used for results, not training).<br><br>    Example Line:<br>    Car 0.00 0 -1.82 587.00 156.40 615.00 189.50 1.48 1.60 3.69 1.84 1.47 8.41 -1.56<br><br>Notes:<br>    - \"DontCare\" objects: Regions ignored during training and evaluation. Their bounding boxes can overlap with actual objects.<br>    - Camera coordinates: All 3D values are given relative to the camera coordinate system, with the camera at the origin.<br>    \"\"\"<br><\/pre>\n<p>The color images are presented as files in the folder and they can be read easily, which means without any further operations. As a result of this operation, it can be that <strong><em># of training and testing images: 7481 \/\u00a07518<\/em><\/strong><\/p>\n<p>The next data that we will be taking into consideration is the calibration files for each scene. As I did before, I prepared another function to parse calibration files as\u00a0follows.<\/p>\n<pre>def parse_calib_file(calib_file_path):<br>    \"\"\"<br>        Parses a calibration file to extract and organize key transformation matrices.<br>        <br>        The calibration file contains the following data:<br>        - P0, P1, P2, P3: 3x4 projection matrices for the respective cameras.<br>        - R0: 3x3 rectification matrix for aligning data points across sensors.<br>        - Tr_velo_to_cam: 3x4 transformation matrix from the LiDAR frame to the camera frame.<br>        - Tr_imu_to_velo: 3x4 transformation matrix from the IMU frame to the LiDAR frame.<br><br>        Parameters:<br>        calib_file_path (str): Path to the calibration file.<br><br>        Returns:<br>        dict: A dictionary where each key corresponds to a calibration parameter <br>            (e.g., 'P0', 'R0') and its value is the associated 3x4 NumPy matrix.<br>        <br>        Process:<br>        1. Reads the calibration file line by line.<br>        2. Maps each line to its corresponding key ('P0', 'P1', etc.).<br>        3. Extracts numerical elements, converts them to a NumPy 3x4 matrix, <br>        and stores them in a dictionary.<br><br>        Example:<br>        Input file line for 'P0':<br>        P0: 1.0 0.0 0.0 0.0  0.0 1.0 0.0 0.0  0.0 0.0 1.0 0.0<br>        Output dictionary:<br>        {<br>            'P0': [[1.0, 0.0, 0.0, 0.0],<br>                [0.0, 1.0, 0.0, 0.0],<br>                [0.0, 0.0, 1.0, 0.0]]<br>        }<br>    \"\"\"<\/pre>\n<p>The final data is the Velodyne point cloud and they are presented in \u2018.bin\u2019 format. In this format, each point cloud line consists of the location of x, y, and z plus the reflectivity score. As before, the corresponding parse function is as\u00a0follows.<\/p>\n<pre>def read_velodyne_bin(file_path):<br>    \"\"\"<br>    Reads a KITTI Velodyne .bin file and returns the point cloud data as a numpy array.<br><br>    :param file_path: Path to the .bin file<br>    :return: Numpy array of shape (N, 4) where N is the number of points,<br>             and each point has (x, y, z, reflectivity)<br>                 <br>    ### For KITTI's Velodyne LiDAR point cloud, the coordinate system used is forward-right-up (FRU).<br>    KITTI Coordinate System (FRU):<br>        X-axis (Forward): Points in the positive X direction move forward from the sensor.<br>        Y-axis (Right): Points in the positive Y direction move to the right of the sensor.<br>        Z-axis (Up): Points in the positive Z direction move upward from the sensor.<br><br>    <br>    ### Units: All coordinates are in meters (m). A point (10, 5, 2) means:<br><br>        It is 10 meters forward.<br>        5 meters to the right.<br>        2 meters above the sensor origin.<br>        Reflectivity: The fourth value in KITTI\u2019s .bin files represents the reflectivity or intensity of the LiDAR laser at that point. It is unrelated to the coordinate system but adds extra context for certain tasks like segmentation or object detection.<br><br>        Velodyne Sensor Placement:<br><br>        The LiDAR sensor is mounted on a vehicle at a specific height and offset relative to the car's reference frame.<br>        The point cloud captures objects relative to the sensor\u2019s position.<br><br>    \"\"\"<\/pre>\n<p>At the end of this section, all the required files will be loaded and ready to be\u00a0used.<\/p>\n<p>For the sample scene, which was presented at the top of this post in the \u2018Problem Definition\u2019 section, there are 122794 points in the point\u00a0cloud.<\/p>\n<p>But since that amount of information could be hard to analyze for some systems in terms of CPU or GPU power, we may want to reduce the number of points in the cloud. To make it possible we can use the \u201cVoxel Downsampling\u201d operation, which is similar to the \u201cPooling\u201d operation in deep neural networks. Roughly it divides the complete point cloud into a grid of equally sized voxels and chooses a single point from each\u00a0voxel.<\/p>\n<pre>print(f\"Points before downsampling: {len(sample_point_cloud.points)} \")<br>sample_point_cloud = sample_point_cloud.voxel_down_sample(voxel_size=0.2)<br>print(f\"Points after downsampling: {len(sample_point_cloud.points)}\")<\/pre>\n<p>The output of this downsampling looks like\u00a0this;<\/p>\n<p>Points before downsampling: 122794 <br \/>Points after downsampling: 33122<\/p>\n<p>But it shouldn\u2019t be forgotten that reducing the number of points may cause to loss of some information as might be expected. Also, the voxel grid size is a hyper-parameter that we can choose is another crucial thing. Smaller sizes return a high number of points or vice\u00a0versa.<\/p>\n<p>But, before getting into the road segmentation by RANSAC, let&#8217;s quickly re-visit the Voxel Downsampling operation together.<\/p>\n<h4>Voxel Downsampling<\/h4>\n<p>Voxel Downsampling is a technique to create a downsampled point cloud. It highly helps to reduce some noise and not-required points. It also reduces the required computational power in light of the selected voxel grid size hyperparameter. The visualization of this operation can be given as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/550\/1%2AAuzuTx9takmLmMpGs9uJGA.png?ssl=1\"><figcaption>The illustration of Voxel Downsampling (Image taken from <a href=\"https:\/\/www.mdpi.com\/2076-3417\/14\/8\/3160\">https:\/\/www.mdpi.com\/2076-3417\/14\/8\/3160<\/a>)<\/figcaption><\/figure>\n<p>Besides that, the steps of this algorithm can be presented as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/732\/1%2AaLebTGa4lsaVSllgu-sFag.png?ssl=1\"><\/figure>\n<p>To apply this function, we will be using the \u201copen3d\u201d library with a single\u00a0line;<\/p>\n<pre>sample_point_cloud = sample_point_cloud.voxel_down_sample(voxel_size=0.2)<\/pre>\n<p>In the above single-line code, it can be observed that the voxel size is chosen as\u00a00.2<\/p>\n<h4>RANSAC<\/h4>\n<p>The next step will be segmenting the largest plane, which is the road for our problem. RANSAC, Random Sample Consensus, is an iterative algorithm and works by randomly sampling a subset of the data points to hypothesize a model and then evaluating its fit to the entire dataset. It aims to find the model that best explains the inliers while ignoring the outliers.<\/p>\n<p>While the algorithm is highly robust to the extreme outliers, it requires to sample of <em>n<\/em> points at the beginning (n=2 for a 2D line or 3 for a 3D plane). Then evaluates the performance of the mathematical equation with respect to it. Then it\u00a0means;<\/p>\n<p>\u2014 the chosen points at the beginning are so\u00a0crucial<\/p>\n<p>\u2014 the number of iterations to find the best values is so\u00a0crucial<\/p>\n<p>\u2014 it may require some computation power, especially for large\u00a0datasets<\/p>\n<p>But it\u2019s a kind of de-facto operation for many different cases. So first let&#8217;s visualize the RANSAC to find a 2D line then let me present the key steps of this algorithm.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/640\/0%2AOg8LqFss_DcXjmKV.gif?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/836\/1%2AIrUguJg_lwALPzmqw2q-Jg.png?ssl=1\"><figcaption>The key steps and working flow of the RANSAC algorithm<\/figcaption><\/figure>\n<p>After reviewing the concept of RANSAC, it is time to apply the algorithm on the point cloud to determine the largest plane, which is a road, for our\u00a0problem.<\/p>\n<pre># 3. RANSAC Segmentation to identify the largest plane<br>plane_model, inliers = sample_point_cloud.segment_plane(distance_threshold=0.3, ransac_n=3, num_iterations=150)<br><br>## Identify inlier points -&gt; road<br>inlier_cloud = sample_point_cloud.select_by_index(inliers)<br>inlier_cloud.paint_uniform_color([0, 1, 1]) # R, G, B format<br><br>## Identify outlier points -&gt; objects on the road<br>outlier_cloud = sample_point_cloud.select_by_index(inliers, invert=True)<br>outlier_cloud.paint_uniform_color([1, 0, 0]) # R, G, B format<\/pre>\n<p>The output of this process will show the outside of the road in red and the road will be colored in a mixture of Green and\u00a0Blue.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AHzbfN-lUab8gVRajU8PDCQ.png?ssl=1\"><figcaption>The output of the RANSAC algorithm (Image taken from KITTI dataset\u00a0[3])<\/figcaption><\/figure>\n<h4>DBSCAN\u200a\u2014\u200aa density-based clustering non-parametric algorithm<\/h4>\n<p>At this stage, the detection of objects outside the road will be performed using the segmented version of the road with\u00a0RANSAC.<\/p>\n<p>In this context, we will be using unsupervised learning algorithms. However, the question that may come to mind here is <strong>\u201cCan\u2019t a detection be made using supervised learning algorithms?\u201d <\/strong>The answer is very short and clear: Yes! However, since we want to introduce the problem and get a quick result with this blog post, we will continue with DBSCAN, which is a segmentation algorithm in the unsupervised learning domain. If you would like to see the results with a supervised learning-based object detection algorithm on point clouds, please indicate this in the comments.<\/p>\n<p>Anyway, let\u2019s try to answer these three questions: What is DBSCAN and how does it work? What are the hyper-parameters to consider? How do we apply it to this\u00a0problem?<\/p>\n<p>DBSCAN also known as a density-based clustering non-parametric algorithm, is an unsupervised clustering algorithm. Even if there are some other unsupervised clustering algorithms, maybe one of the most popular ones is K-Means, DBSCAN is capable of clustering the objects in arbitrary shape while K-Means asumes the shape of the object is spherical. Moreover, probably the most important feature of DBSCAN is that it does not require the number of clusters to be defined\/estimated in advance, as in the K-Means algorithm. If you would like to analyze some really good visualizations for some specific problems like \u201c2Moons\u201d, you can visit here: <a href=\"https:\/\/www.kaggle.com\/code\/ahmedmohameddawoud\/dbscan-vs-k-means-visualizing-the-difference\"><em>https:\/\/www.kaggle.com\/code\/ahmedmohameddawoud\/dbscan-vs-k-means-visualizing-the-difference<\/em><\/a><\/p>\n<p>DBSCAN works like our eyes. It means it takes the densities of different groups in the data and then makes a decision for clustering. It has two different hyper-parameters: \u201cEpsilon\u201d and \u201cMinimumPoints\u201d. Initially, DBSCAN identifies <em>core points<\/em>, which are points with at least a minimum number of neighbors (<em>minPts<\/em>) within a specified radius (<em>epsilon<\/em>). Clusters are then formed by expanding from these core points, connecting all reachable points within the density criteria. Points that cannot be connected to any cluster are classified as noise. To get in-depth information about this algorithm like \u2018Core Point\u2019, \u2018Border Point\u2019 and \u2018Noise Point\u2019 please visit there: <em>Josh Starmer, <\/em><a href=\"https:\/\/www.youtube.com\/watch?v=RDZUdRSDOok&amp;t=61s\"><em>https:\/\/www.youtube.com\/watch?v=RDZUdRSDOok&amp;t=61s<\/em><\/a><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A5DThbuTIUcBnbvY5LpznIA.jpeg?ssl=1\"><figcaption>A sample clustering result of the DBSCAN algorithm<\/figcaption><\/figure>\n<p>For our problem, while we can use DBSCAN from the SKLearn library, let&#8217;s use the open3d as\u00a0follows.<\/p>\n<pre># 4. Clustering using DBSCAN -&gt; To further segment objects on the road<br>with o3d.utility.VerbosityContextManager(o3d.utility.VerbosityLevel.Debug) as cm:<br>    labels = np.array(outlier_cloud.cluster_dbscan(eps=0.45, min_points=10, print_progress=True))<\/pre>\n<p>As we can see, \u2018epsilon\u2019 was chosen as 0.45, and \u2018MinPts\u2019 was chosen as 10. A quick comment about these. Since they are hyper-parameters, there are no best \u201cnumbers\u201d out there. Unfortunately, it\u2019s a matter of trying and measuring success. But no worries! After you read the last chapter of this blog post, \u201cEvaluation Metrics\u201d, you will be able to measure your algorithm\u2019s performance in total. Then it means you can apply GridSearch <em>( ref: <\/em><a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2021\/06\/tune-hyperparameters-with-gridsearchcv\/\"><em>https:\/\/www.analyticsvidhya.com\/blog\/2021\/06\/tune-hyperparameters-with-gridsearchcv\/<\/em><\/a><em>) <\/em>to find the best hyper-param pairs!<\/p>\n<p>Yep, then let me visualize the output of DBCAN for our point cloud then let&#8217;s move to the next\u00a0step!<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AGpBs15b_8TZLCBjYatNnog.png?ssl=1\"><figcaption>The output of the DBSCAN clustering algorithm (Image taken from KITTI dataset\u00a0[3])<\/figcaption><\/figure>\n<p>To recall, we can see that some of the objects that I first showed and marked by hand are separate and in different colors here! This shows that these objects belong to different clusters (as it should\u00a0be).<\/p>\n<h4>G.T. Labels and Their Calibration Process<\/h4>\n<p>Now it\u2019s time to analyze G.T. labels and Calibration files of the KITTI 3D Object Detection benchmark. In the previous section, I shared some tips about them like how to read, how to parse,\u00a0etc.<\/p>\n<p>But now I want to mention the relation between the G.T. object and the Calibration matrices. First of all, let me share a figure of the G.T. file and the Calibration file side by\u00a0side.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/706\/1%2Awgr3djXfEw8AO0GbXIP38w.png?ssl=1\"><figcaption>A sample training label file in\u00a0.txt\u00a0format<\/figcaption><\/figure>\n<p>As we discussed before, the last element of the training label refers to the rotation of the object around the y-axis. The three numbers before the rotation element (1.84, 1.47, and 8.41) stand for the 3D location of the object\u2019s centroid in the camera coordinate system.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/957\/1%2Aa6u3ll7__X-ncSwDoRH2bQ.png?ssl=1\"><figcaption>A sample calibration file in\u00a0.txt\u00a0format<\/figcaption><\/figure>\n<p>On the calibration file side; <em>P0, P1, P2<\/em>, and <em>P3<\/em> are the camera projection matrices for their corresponding cameras. In this blog post, as we indicated before, we are using the \u2018Left Color Images\u2019 which is equal to <em>P2<\/em>. Also, <em>R0_rect <\/em>is a rectification matrix for aligning stereo images. As can be understood from their names, <em>Tr_velo_to_cam <\/em>and <em>Tr_imu_to_velo <\/em>are transformation matrices that will be used to provide the transition between different coordinate systems. For example, <em>Tr_velo_to_cam<\/em> is a transformation matrix converting Velodyne coordinates to the unrectified camera coordinate system.<\/p>\n<p>After this explanation, I really paid attention to which matrix or which label in the which coordinate system, now we can mention the transformation of G.T. object coordinates to the Velodyne coordinate system easily. It\u2019s a good point to both understand the use of matrices between coordinate systems and evaluate our predicted bounding boxes and G.T. object bounding\u00a0boxes.<\/p>\n<p>The first thing that we will be doing is computing the G.T. object bounding box in 3D. To do so, you can reach out to the following function in the\u00a0repo.<\/p>\n<pre>def compute_box_3d(obj, Tr_cam_to_velo):<br>    \"\"\"<br>    Compute the 8 corners of a 3D bounding box in Velodyne coordinates.<br>    Args:<br>        obj (dict): Object parameters (dimensions, location, rotation_y).<br>        Tr_cam_to_velo (np.ndarray): Camera to Velodyne transformation matrix.<br>    Returns:<br>        np.ndarray: Array of shape (8, 3) with the 3D box corners.<br>    \"\"\"<\/pre>\n<p>Given an object\u2019s dimensions (height, width, length) and position (x, y, z) in the camera coordinate system, this function first rotates the bounding box based on its orientation (rotation_y) and then computes the corners of the box in 3D\u00a0space.<\/p>\n<p>This computation is based on the transformation that uses a matrix that is capable of transferring any point from the camera coordinate system to the Velodyne coordinate system. But, wait? We don\u2019t have the camera to Velodyne matrix, do we? Yes, we need to calculate it first by taking the inverse of the <em>Tr_velo_to_cam <\/em>matrix, which is presented in the calibration files.<\/p>\n<p>No worries, all this workflow is presented by these functions.<\/p>\n<pre>def transform_points(points, transformation):<br>    \"\"\"<br>    Apply a transformation matrix to 3D points.<br>    Args:<br>        points (np.ndarray): Nx3 array of 3D points.<br>        transformation (np.ndarray): 4x4 transformation matrix.<br>    Returns:<br>        np.ndarray: Transformed Nx3 points.<br>    \"\"\"<\/pre>\n<pre>def inverse_rigid_trans(Tr):<br>    \"\"\"<br>    Inverse a rigid body transform matrix (3x4 as [R|t]) to [R'|-R't; 0|1].<br>    Args:<br>        Tr (np.ndarray): 4x4 transformation matrix.<br>    Returns:<br>        np.ndarray: Inverted 4x4 transformation matrix.<br>    \"\"\"<\/pre>\n<p>In the end, we can easily see the G.T. objects and project them into the Velodyne point cloud coordinate system. Now let&#8217;s visualize the output and then jump into the evaluation section!<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AnnclCaITZwf3ghwa9i0GsQ.png?ssl=1\"><figcaption>The projected G.T. object bounding boxes (Image taken from KITTI dataset\u00a0[3])<\/figcaption><\/figure>\n<p>(I know the green bounding boxes can be a little hard to see, so I added arrows next to them in\u00a0black.)<\/p>\n<h4>Evaluation Metrics<\/h4>\n<p>Now we have the predicted bounding boxes by our pipeline and G.T. object boxes! Then let&#8217;s calculate some metrics to evaluate our pipeline. In order to perform the hyperparameter optimization that we talked about earlier, we must be able to continuously monitor our performance for each parameter group.<\/p>\n<p>But before getting into the evaluation metric I need to mention two things. First of all, KITTI has different evaluation criteria for different objects. For example, while a 50% match between the labels produced for pedestrians and G.T. is sufficient, it is 70% for vehicles. Another issue is that while the pipeline we created performs object detection in a 360-degree environment, the KITTI G.T. labels only include the label values \u200b\u200bof the objects in the viewing angle of the color cameras. Consequently, we can detect more bounding boxes than presented in G.T. label files. So what to do? Based on the concepts I will talk about here, you can reach the final result by carefully analyzing KITTI\u2019s evaluation criteria. But for now, I will not do a more detailed analysis in this section for the continuation posts of this Medium blog post\u00a0series.<\/p>\n<p>To evaluate the predicted bounding boxes and G.T. bounding boxes, we will be using the TP, FP, and FN\u00a0metrics.<\/p>\n<p>TP represents the predicted boxes that match with G.T. boxes, FP stands for the predicted boxes that do NOT match with any G.T. boxes, and FN is the condition that there are no corresponding predicted bounding boxes for G.T. bounding\u00a0boxes.<\/p>\n<p>In this context, of course, we need to find a tool to measure how a predicted bounding box and a G.T. bounding box match. The name of our tool is IOU, intersected over\u00a0union.<\/p>\n<p>You can easily reach out to the IOU and evaluation functions as\u00a0follows.<\/p>\n<pre>def compute_iou(box1, box2):<br>    \"\"\"<br>    Calculate the Intersection over Union (IoU) between two bounding boxes.<br>    :param box1: open3d.cpu.pybind.geometry.AxisAlignedBoundingBox object for the first box<br>    :param box2: open3d.cpu.pybind.geometry.AxisAlignedBoundingBox object for the second box<br>    :return: IoU value (float)<br>    \"\"\"<\/pre>\n<pre># Function to evaluate metrics (TP, FP, FN)<br>def evaluate_metrics(ground_truth_boxes, predicted_boxes, iou_threshold=0.5):<br>    \"\"\"<br>    Evaluate True Positives (TP), False Positives (FP), and False Negatives (FN).<br>    :param ground_truth_boxes: List of AxisAlignedBoundingBox objects for ground truth<br>    :param predicted_boxes: List of AxisAlignedBoundingBox objects for predictions<br>    :param iou_threshold: IoU threshold for a match<br>    :return: TP, FP, FN counts<br>    \"\"\"<br><\/pre>\n<p>Let me finalize this section by giving predicted bounding boxes (RED) and G.T. bounding boxes (GREEN) over the point\u00a0cloud.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AbE9qU-mslvKufym5A6Pggw.png?ssl=1\"><figcaption>Predicted bounding boxes and G.T. bounding boxes are presented together on the point cloud (Image taken from KITTI dataset\u00a0[3])<\/figcaption><\/figure>\n<h4>Conclusion<\/h4>\n<p>Yeah, it\u2019s a little bit long, but we are about to finish it. First, we have learned a couple of things about the KITTI 3D Object Detection Benchmark and some terminology about different topics, like camera coordinate systems and unsupervised learning, etc.<\/p>\n<p>Now interested readers can extend this study by adding a grid search to find the best hyper-param elements. For example, the number of minimum points in segmentation, or maybe the # of iteration RANSAC or the voxel grid size in Voxel Downsampling operation, all could be possible improvement points.<\/p>\n<h4>What\u2019s next?<\/h4>\n<p>The next part will be investigating object detection on ONLY Left Color Camera frames. This is another fundamental step of this series cause we will be fusing the Lidar Point Cloud and Color Camera frames in the last part of this blog series. Then we will be able to make a conclusion and answer this question: <em>\u201cDoes Sensor Fusion reduce the uncertainty and improve the performance in KITTI 3D Object Detection Benchmark?\u201d<\/em><\/p>\n<blockquote><p><strong>Any comments, error fixes, or improvements are\u00a0welcome!<\/strong><\/p><\/blockquote>\n<blockquote><p><strong><em>Thank you all and I wish you healthy\u00a0days.<\/em><\/strong><\/p><\/blockquote>\n<p>********************************************************************************************************************************************************<\/p>\n<p><strong><em>Github link<\/em><\/strong>: <a href=\"https:\/\/github.com\/ErolCitak\/KITTI-Sensor-Fusion\/tree\/main\">https:\/\/github.com\/ErolCitak\/KITTI-Sensor-Fusion\/tree\/main\/lidar_based_obstacle_detection<\/a><\/p>\n<p><strong>References<\/strong><\/p>\n<p>[1]\u200a\u2014\u200a<a href=\"https:\/\/www.cvlibs.net\/datasets\/kitti\/\">https:\/\/www.cvlibs.net\/datasets\/kitti\/<\/a><\/p>\n<p>[2]\u200a\u2014\u200a<a href=\"https:\/\/www.cvlibs.net\/datasets\/kitti\/eval_object.php?obj_benchmark=3d\">https:\/\/www.cvlibs.net\/datasets\/kitti\/eval_object.php?obj_benchmark=3d<\/a><\/p>\n<p>[3]\u200a\u2014\u200aGeiger, Andreas, et al. \u201cVision meets robotics: The kitti dataset.\u201d <em>The International Journal of Robotics Research<\/em> 32.11 (2013): 1231\u20131237.<\/p>\n<h4>Disclaimer<\/h4>\n<p>The images used in this blog series are taken from the KITTI dataset for education and research purposes. If you want to use it for similar purposes, you must go to the relevant website, approve the intended use there, and use the citations defined by the benchmark creators as\u00a0follows.<\/p>\n<p>For the <strong>stereo 2012<\/strong>, <strong>flow 2012<\/strong>, <strong>odometry<\/strong>, <strong>object detection,<\/strong> or <strong>tracking benchmarks<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Geiger2012CVPR.pdf\">Geiger2012CVPR<\/a>,<br \/> author = {<a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_lenz.php\">Philip Lenz<\/a> and <a href=\"http:\/\/ttic.uchicago.edu\/~rurtasun\">Raquel Urtasun<\/a>},<br \/> title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},<br \/> booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},<br \/> year = {2012}<br \/>}<\/p>\n<p>For the <strong>raw dataset<\/strong>, please cite:<br \/>@article{<a href=\"https:\/\/www.cvlibs.net\/publications\/Geiger2013IJRR.pdf\">Geiger2013IJRR<\/a>,<br \/> author = {<a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_lenz.php\">Philip Lenz<\/a> and <a href=\"http:\/\/www.mrt.kit.edu\/mitarbeiter_stiller.php\">Christoph Stiller<\/a> and <a href=\"http:\/\/ttic.uchicago.edu\/~rurtasun\">Raquel Urtasun<\/a>},<br \/> title = {Vision meets Robotics: The KITTI Dataset},<br \/> journal = {International Journal of Robotics Research (IJRR)},<br \/> year = {2013}<br \/>}<\/p>\n<p>For the <strong>road benchmark<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Fritsch2013ITSC.pdf\">Fritsch2013ITSC<\/a>,<br \/> author = {Jannik Fritsch and Tobias Kuehnl and <a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a>},<br \/> title = {A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms},<br \/> booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},<br \/> year = {2013}<br \/>}<\/p>\n<p>For the <strong>stereo 2015<\/strong>, <strong>flow 2015,<\/strong> and <strong>scene flow 2015 benchmarks<\/strong>, please cite:<br \/>@inproceedings{<a href=\"https:\/\/www.cvlibs.net\/publications\/Menze2015CVPR.pdf\">Menze2015CVPR<\/a>,<br \/> author = {<a href=\"http:\/\/www.ipi.uni-hannover.de\/tmm.html\">Moritz Menze<\/a> and <a href=\"https:\/\/www.cvlibs.net\/\">Andreas Geiger<\/a>},<br \/> title = {Object Scene Flow for Autonomous Vehicles},<br \/> booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},<br \/> year = {2015}<br \/>}<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=9c5f4bc8d497\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/sensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497\">Sensor Fusion\u200a\u2014\u200aKITTI\u200a\u2014\u200a\u2018Lidar-based Obstacle Detection\u2019\u200a\u2014\u200aPart-1<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Erol \u00c7\u0131tak<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fsensor-fusion-kitti-lidar-based-obstacle-detection-part-1-9c5f4bc8d497\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sensor Fusion\u200a\u2014\u200aKITTI\u200a\u2014\u200a\u2018Lidar-based Obstacle Detection\u2019\u200a\u2014\u200aPart-1 Mastering Sensor Fusion: LiDAR Obstacle Detection with KITTI Data\u200a\u2014\u200aPart\u00a01 How to use Lidar data for obstacle detection with unsupervised learning Sensor fusion, multi-modal perception, autonomous vehicles\u200a\u2014\u200aif these keywords pique your interest, this Medium blog is for you. Join me as I explore the fascinating world of LiDAR and color image-based environment [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,221,166,1072,1073,70],"tags":[489,1075,1074],"class_list":["post-937","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-computer-vision","category-hands-on-tutorials","category-image-processing","category-lidar","category-machine-learning","tag-detection","tag-kitti","tag-sensor"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/937"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=937"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/937\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=937"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=937"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}