{"id":2290,"date":"2025-03-08T07:02:23","date_gmt":"2025-03-08T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/08\/custom-training-pipeline-for-object-detection-models\/"},"modified":"2025-03-08T07:02:23","modified_gmt":"2025-03-08T07:02:23","slug":"custom-training-pipeline-for-object-detection-models","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/08\/custom-training-pipeline-for-object-detection-models\/","title":{"rendered":"Custom Training Pipeline for Object Detection Models"},"content":{"rendered":"<p>    Custom Training Pipeline for Object Detection Models<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That\u2019s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs and tasks. Thanks to <a href=\"https:\/\/github.com\/ultralytics\/ultralytics\">Ultralytics<\/a>, <a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\">YOLOx<\/a>, <a href=\"https:\/\/github.com\/tinyvision\/DAMO-YOLO\">DAMO-YOLO<\/a>, <a href=\"https:\/\/github.com\/lyuwenyu\/RT-DETR\">RT-DETR<\/a> and <a href=\"https:\/\/github.com\/Peterande\/D-FINE\">D-FINE<\/a> repos, I leveraged them to gain deeper understanding into various design details. I ended up implementing <a href=\"https:\/\/paperswithcode.com\/sota\/real-time-object-detection-on-coco?p=d-fine-redefine-regression-task-in-detrs-as\">SoTA real-time object detection model D-FINE<\/a> in my custom pipeline.<\/p>\n<h2 class=\"wp-block-heading\">Plan<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Dataset, Augmentations and transforms:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Mosaic (with affine transforms)<\/li>\n<li class=\"wp-block-list-item\">Mixup and Cutout<\/li>\n<li class=\"wp-block-list-item\">Other augmentations with bounding boxes<\/li>\n<li class=\"wp-block-list-item\">Letterbox vs simple resize<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">Training:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Optimizer<\/li>\n<li class=\"wp-block-list-item\">Scheduler<\/li>\n<li class=\"wp-block-list-item\">EMA<\/li>\n<li class=\"wp-block-list-item\">Batch accumulation<\/li>\n<li class=\"wp-block-list-item\">AMP<\/li>\n<li class=\"wp-block-list-item\">Grad clipping<\/li>\n<li class=\"wp-block-list-item\">Logging<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">Metrics:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">mAPs from TorchMetrics \/ cocotools<\/li>\n<li class=\"wp-block-list-item\">How to compute Precision, Recall, IoU?<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">Pick a suitable solution for your case<\/li>\n<li class=\"wp-block-list-item\">Experiments<\/li>\n<li class=\"wp-block-list-item\">Attention to data preprocessing<\/li>\n<li class=\"wp-block-list-item\">Where to start<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Dataset<\/h2>\n<p class=\"wp-block-paragraph\">Dataset processing is the first thing you usually start working on. With object detection, you need to load your image and annotations. Annotations are often stored in COCO format as a json file or YOLO format, with txt file for each image. Let\u2019s take a look at the YOLO format: Each line is structured as:\u00a0<code>class_id<\/code>, <code>x_center<\/code>, <code>y_center<\/code>, <code>width<\/code>, <code>height<\/code>, where bbox values are normalized between 0 and 1.<\/p>\n<p class=\"wp-block-paragraph\">When you have your images and txt files, you can write your dataset class, nothing tricky here. Load everything, transform (augmentations included) and return during training. I prefer splitting the data by creating a CSV file for each split and then reading it in the Dataloader class rather than physically moving files into train\/val\/test folders. This is an example of a customization that helped my use case.<\/p>\n<h2 class=\"wp-block-heading\">Augmentations<\/h2>\n<p class=\"wp-block-paragraph\">Firstly, when augmenting images for object detection, it\u2019s crucial to apply the same transformations to the bounding boxes. To comfortably do that I use <a href=\"https:\/\/albumentations.ai\/\">Albumentations<\/a> lib. For example:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\u00a0\u00a0\u00a0\u00a0def _init_augs(self, cfg) -&gt; None:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.keep_ratio:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0resize = [\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.PadIfNeeded(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0min_height=self.target_h,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0min_width=self.target_w,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0border_mode=cv2.BORDER_CONSTANT,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0fill=(114, 114, 114),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0]\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0else:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0resize = [A.Resize(self.target_h, self.target_w)]\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0norm = [\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.Normalize(mean=self.norm[0], std=self.norm[1]),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ToTensorV2(),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0]\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.mode == \"train\":\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0augs = [\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.RandomBrightnessContrast(p=cfg.train.augs.brightness),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.RandomGamma(p=cfg.train.augs.gamma),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.Blur(p=cfg.train.augs.blur),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.ToGray(p=cfg.train.augs.to_gray),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.Affine(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rotate=[90, 90],\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0p=cfg.train.augs.rotate_90,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0fit_output=True,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.HorizontalFlip(p=cfg.train.augs.left_right_flip),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A.VerticalFlip(p=cfg.train.augs.up_down_flip),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0]\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.transform = A.Compose(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0augs + resize + norm,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0bbox_params=A.BboxParams(format=\"pascal_voc\", label_fields=[\"class_labels\"]),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0elif self.mode in [\"val\", \"test\", \"bench\"]:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.mosaic_prob = 0\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.transform = A.Compose(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0resize + norm,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0bbox_params=A.BboxParams(format=\"pascal_voc\", label_fields=[\"class_labels\"]),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Secondly, there are a lot of interesting and not trivial augmentations:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/arxiv.org\/pdf\/2004.10934\">Mosaic<\/a>. The idea is simple, let\u2019s take several images (for example 4), and stack them together in a grid (2\u00d72). Then let\u2019s do some affine transforms and feed it to the model.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/arxiv.org\/pdf\/1905.04899\">MixUp<\/a>. Originally used in image classification (it\u2019s surprising that it works). Idea \u2013 let\u2019s take two images, put them onto each other with some percent of transparency. In classification models it usually means that if one image is 20% transparent and the second is 80%, then the model should predict 80% for class 1 and 20% for class 2. In object detection we just get more objects into 1 image.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Cutout<\/strong>. Cutout involves removing parts of the image (by replacing them with black pixels) to help the model learn more robust features.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">I see\u00a0mosaic\u00a0often applied with Probability 1.0 of the first ~90% of epochs. Then,\u00a0it\u2019s usually turned off,\u00a0and lighter augmentations are used.\u00a0The same idea applies to\u00a0mixup, but I see it being used a lot less (for the most popular detection framework,\u00a0Ultralytics,\u00a0it\u2019s turned off by default. For another one,\u00a0I see P=0.15).\u00a0Cutout\u00a0seems to be used less frequently.<\/p>\n<p class=\"wp-block-paragraph\">You can read more about those augmentations in these two articles:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2004.10934\">1<\/a>,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1905.04899\">2<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Results from just turning on mosaic are pretty good (darker one without mosaic got mAP 0.89 vs 0.92 with, tested on a real dataset)\u00a0<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"fbf9f9\" data-has-transparency=\"true\" style=\"--dominant-color: #fbf9f9;\" fetchpriority=\"high\" decoding=\"async\" width=\"796\" height=\"620\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/mosaic.png?resize=796%2C620&#038;ssl=1\" alt=\"\" class=\"wp-image-599086 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/mosaic.png 796w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/mosaic-300x234.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/mosaic-768x598.png 768w\" sizes=\"(max-width: 796px) 100vw, 796px\"><figcaption class=\"wp-element-caption\">Author\u2019s metrics on a custom dataset, logged in Wandb<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Letterbox or simple resize?<\/h2>\n<p class=\"wp-block-paragraph\">During training, you usually resize the input image to a square. Models often use 640\u00d7640 and benchmark on COCO dataset. And there are two main ways how you get there:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Simple resize to a target size.<\/li>\n<li class=\"wp-block-list-item\">Letterbox: Resize the longest side to the target size (e.g., 640), preserving the aspect ratio, and pad the shorter side to reach the target dimensions.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"5f6663\" data-has-transparency=\"false\" style=\"--dominant-color: #5f6663;\" decoding=\"async\" width=\"640\" height=\"640\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/simple_resize.jpg?resize=640%2C640&#038;ssl=1\" alt=\"\" class=\"wp-image-599087 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/simple_resize.jpg 640w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/simple_resize-300x300.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/simple_resize-150x150.jpg 150w\" sizes=\"(max-width: 640px) 100vw, 640px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset with ground truth bounding boxes, preprocessed with a simple resize function<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"666b69\" data-has-transparency=\"false\" style=\"--dominant-color: #666b69;\" decoding=\"async\" width=\"640\" height=\"640\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/letterbox.jpg?resize=640%2C640&#038;ssl=1\" alt=\"\" class=\"wp-image-599088 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/letterbox.jpg 640w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/letterbox-300x300.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/letterbox-150x150.jpg 150w\" sizes=\"(max-width: 640px) 100vw, 640px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset with ground truth bounding boxes, preprocessed with a letterbox<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Both approaches have advantages and disadvantages. Let\u2019s discuss them first, and then I will share the results of numerous experiments I ran comparing these approaches.<\/p>\n<p class=\"wp-block-paragraph\">Simple resize:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Compute goes to the whole image, with no useless padding.<\/li>\n<li class=\"wp-block-list-item\">\u201cDynamic\u201d aspect ratio may act as a form of regularization.<\/li>\n<li class=\"wp-block-list-item\">Inference preprocessing perfectly matches training preprocessing (augmentations excluded).<\/li>\n<li class=\"wp-block-list-item\">Kills real geometry. Resize distortion could affect the spatial relationships in the image. Although it might be a human bias to think that a fixed aspect ratio is important.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Letterbox:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Preserves real aspect ratio.<\/li>\n<li class=\"wp-block-list-item\">During inference, you can cut padding and run not on the square image if you don\u2019t lose accuracy (some models can degrade).<\/li>\n<li class=\"wp-block-list-item\">Can train on a bigger image size, then inference with cut padding to get the same inference latency as with simple resize. For example 640\u00d7640 vs 832\u00d7480. The second one will preserve the aspect ratios and objects will appear +- the same size.<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Part of the compute is wasted on gray padding.<\/li>\n<li class=\"wp-block-list-item\">Objects get smaller.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">How to test it and decide which one to use?\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Train from scratch with parameters:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Simple resize, 640\u00d7640<\/li>\n<li class=\"wp-block-list-item\">Keep aspect ratio, max side 640, and add padding (as a baseline)<\/li>\n<li class=\"wp-block-list-item\">Keep aspect ratio, larger image size (for example max side 832), and add padding Then inference 3 models. When the aspect ratio is preserved \u2013 cut padding during the inference. Compare latency and metrics.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Example of the same image from above with cut padding (640\u200a\u00d7\u200a384):\u00a0<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"626565\" data-has-transparency=\"false\" style=\"--dominant-color: #626565;\" loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"384\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/cut_paddings.jpg?resize=640%2C384&#038;ssl=1\" alt=\"\" class=\"wp-image-599089 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/cut_paddings.jpg 640w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/cut_paddings-300x180.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Here is what happens when you preserve ratio and inference by cutting gray padding:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">params\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 |\u00a0  F1 score\u00a0 |  latency (ms).   |\n-------------------------+-------------+-----------------|\nratio kept, 832\u00a0 \u00a0 \u00a0 \u00a0  |\u00a0 \u00a0 0.633\u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 \u00a0 33.5\u00a0 \u00a0 \u00a0 |\nno ratio, 640x640 \u00a0     |\u00a0 \u00a0 0.617\u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 \u00a0 33.4\u00a0 \u00a0 \u00a0 |<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As shown, training with preserved aspect ratio at a larger size (832) achieved a higher F1 score (0.633) compared to a simple 640\u00d7640 resize (F1 score of 0.617), while the latency remained similar. Note that some models may degrade if the padding is removed during inference, which kills the whole purpose of this trick and probably the letterbox too.<\/p>\n<p class=\"wp-block-paragraph\">What does this mean:\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Training from scratch:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">With the same image size, simple resize gets better accuracy than letterbox.<\/li>\n<li class=\"wp-block-list-item\">For letterbox, If you cut padding during the inference <strong>and your model doesn\u2019t lose accuracy<\/strong> \u2013 you can train and inference with a bigger image size to match the latency, and get a little bit higher metrics (as in the example above).\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Training with pre-trained weights initialized:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">If you finetune \u2013 use the same tactic as the pre-trained model did, it should give you the best results if the datasets are not too different.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">For D-FINE I see lower metrics when cutting padding during inference. Also the model was pre-trained on a simple resize. For YOLO, a letterbox is typically a good choice.<\/p>\n<h2 class=\"wp-block-heading\">Training<\/h2>\n<p class=\"wp-block-paragraph\">Every ML engineer should know how to implement a training loop. Although PyTorch does much of the heavy lifting, you might still feel overwhelmed by the number of design choices available. Here are some key components to consider:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/pytorch.org\/docs\/stable\/optim.html\">Optimizer<\/a> \u2013 start with Adam\/AdamW\/SGD.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/pytorch.org\/docs\/stable\/optim.html#how-to-adjust-learning-rate\">Scheduler<\/a> \u2013 fixed LR can be ok for Adams, but take a look at StepLR, CosineAnnealingLR or OneCycleLR.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/arxiv.org\/abs\/1806.04498\">EMA<\/a>. This is a nice technique that makes training smoother and sometimes achieves higher metrics. After each batch, you update a secondary model (often called the EMA model)\u00a0 by computing an exponential moving average of the primary model\u2019s weights.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/lightning.ai\/docs\/pytorch\/stable\/advanced\/training_tricks.html#accumulate-gradients\">Batch accumulation<\/a> is nice when your vRAM is very limited. Training a transformer-based object detection model means that in some cases even in a middle-sized model you only can fit 4 images into the vRAM. By accumulating gradients over several batches before performing an optimizer step, you effectively simulate a larger batch size without exceeding your memory constraints. Another use case is when you have a lot of negatives (images without target objects) in your dataset and a small batch size, you can encounter unstable training. Batch accumulation can also help here.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/pytorch.org\/docs\/stable\/notes\/amp_examples.html\">AMP<\/a> uses half precision automatically where applicable. It reduces vRAM usage and makes training faster (if you have a GPU that supports it). I see 40% less vRAM usage and at least a 15% training speed increase.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.utils.clip_grad_norm_.html\">Grad clipping<\/a>. Often, when you use AMP, training can become less stable. This can also happen with higher LRs. When your gradients are too big, training will fail. Gradient clipping will make sure gradients are never bigger than a certain value.<\/li>\n<li class=\"wp-block-list-item\">Logging. Try <a href=\"https:\/\/hydra.cc\/docs\/intro\/\">Hydra<\/a> for configs and something like <a href=\"https:\/\/wandb.ai\/site\/\">Weights and Biases<\/a> or <a href=\"https:\/\/clear.ml\/\">Clear ML<\/a> for experiment tracking. Also, log everything locally. Save your best weights, and metrics, so after numerous experiments, you can always find all the info on the model you need.<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\u00a0\u00a0\u00a0\u00a0def train(self) -&gt; None:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0best_metric = 0\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0cur_iter = 0\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ema_iter = 0\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0one_epoch_time = None\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0def optimizer_step(step_scheduler: bool):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\"\"\"\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Clip grads, optimizer step, scheduler step, zero grad, EMA model update\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\"\"\"\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0nonlocal ema_iter\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.amp_enabled:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.clip_max_norm:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.scaler.unscale_(self.optimizer)\n\ntorch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.scaler.step(self.optimizer)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.scaler.update()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0else:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.clip_max_norm:\n\ntorch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.optimizer.step()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if step_scheduler:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.scheduler.step()\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.optimizer.zero_grad()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.ema_model:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ema_iter += 1\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.ema_model.update(ema_iter, self.model)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for epoch in range(1, self.epochs + 1):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0epoch_start_time = time.time()\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.model.train()\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.loss_fn.train()\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0losses = []\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0with tqdm(self.train_loader, unit=\"batch\") as tepoch:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for batch_idx, (inputs, targets, _) in enumerate(tepoch):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0tepoch.set_description(f\"Epoch {epoch}\/{self.epochs}\")\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if inputs is None:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0continue\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0cur_iter += 1\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0inputs = inputs.to(self.device)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0targets = [\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0{\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0k: (v.to(self.device) if (v is not None and hasattr(v, \"to\")) else v)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for k, v in t.items()\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0}\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for t in targets\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0]\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0lr = self.optimizer.param_groups[0][\"lr\"]\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.amp_enabled:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0with autocast(self.device, cache_enabled=True):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0output = self.model(inputs, targets=targets)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0with autocast(self.device, enabled=False):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss_dict = self.loss_fn(output, targets)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss = sum(loss_dict.values()) \/ self.b_accum_steps\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.scaler.scale(loss).backward()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0else:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0output = self.model(inputs, targets=targets)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss_dict = self.loss_fn(output, targets)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss = sum(loss_dict.values()) \/ self.b_accum_steps\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss.backward()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if (batch_idx + 1) % self.b_accum_steps == 0:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0optimizer_step(step_scheduler=True)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0losses.append(loss.item())\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0tepoch.set_postfix(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0loss=np.mean(losses) * self.b_accum_steps,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eta=calculate_remaining_time(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0one_epoch_time,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0epoch_start_time,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0epoch,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.epochs,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0cur_iter,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0len(self.train_loader),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0),\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0vram=f\"{get_vram_usage()}%\",\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Final update for any leftover gradients from an incomplete accumulation step\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if (batch_idx + 1) % self.b_accum_steps != 0:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0optimizer_step(step_scheduler=False)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0wandb.log({\"lr\": lr, \"epoch\": epoch})\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0metrics = self.evaluate(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0val_loader=self.val_loader,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0conf_thresh=self.conf_thresh,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0iou_thresh=self.iou_thresh,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0path_to_save=None,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0best_metric = self.save_model(metrics, best_metric)\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0save_metrics(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0{}, metrics, np.mean(losses) * self.b_accum_steps, epoch, path_to_save=None\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if (\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0epoch &gt;= self.epochs - self.no_mosaic_epochs\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0and self.train_loader.dataset.mosaic_prob\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0):\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.train_loader.dataset.close_mosaic()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if epoch == self.ignore_background_epochs:\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.train_loader.dataset.ignore_background = False\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0logger.info(\"Including background images\")\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0one_epoch_time = time.time() - epoch_start_time<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Metrics<\/h2>\n<p class=\"wp-block-paragraph\">For object detection everyone uses mAP, and it is already standardized how we measure those. Use <a href=\"https:\/\/github.com\/cocodataset\/cocoapi\/tree\/master\/PythonAPI\/pycocotools\">pycocotools<\/a> or <a href=\"https:\/\/github.com\/MiXaiLL76\/faster_coco_eval\">faster-coco-eval<\/a> or <a href=\"https:\/\/lightning.ai\/docs\/torchmetrics\/stable\/\">TorchMetrics<\/a> for mAP. But mAP means that we check how good the model is overall, on all confidence levels. mAP0.5 means that IoU threshold is 0.5 (everything lower is considered as a wrong prediction). I personally don\u2019t fully like this metric, as in production we always use 1 confidence threshold. So why not set the threshold and then compute metrics? That\u2019s why I also always calculate confusion matrices, and based on that \u2013 Precision, Recall, F1-score, and IoU.<\/p>\n<p class=\"wp-block-paragraph\">But logic also might be tricky. Here is what I use:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">1 GT (ground truth) object = 1 predicted object, and it\u2019s a TP if IoU &gt; threshold. If there is no prediction for a GT object \u2013 it\u2019s a FN. If there is no GT for a prediction \u2013 it\u2019s a FP.<\/li>\n<li class=\"wp-block-list-item\">1 GT should be matched by a prediction only 1 time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.<\/li>\n<li class=\"wp-block-list-item\">Class ids should also match. If the model predicts class_0 but GT is class_1, it means FP += 1 and FN += 1.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">During training, I select the best model based on the metrics that are relevant to the task. I typically consider the average of mAP50 and F1-score.<\/p>\n<h2 class=\"wp-block-heading\">Model and loss<\/h2>\n<p class=\"wp-block-paragraph\">I haven\u2019t discussed model architecture and loss function here. They usually go together, and you can choose any model you like and integrate it into your pipeline with everything from above. I did that with DAMO-YOLO and D-FINE, and the results were great.<\/p>\n<h3 class=\"wp-block-heading\">Pick a suitable solution for your case<\/h3>\n<p class=\"wp-block-paragraph\">Many people use Ultralytics, however it has GPLv3, and you can\u2019t use it in commercial projects unless your code is open source. So people often look into Apache 2 and MIT licensed models. Check out <a href=\"https:\/\/github.com\/Peterande\/D-FINE\">D-FINE<\/a>, <a href=\"https:\/\/github.com\/lyuwenyu\/RT-DETR\">RT-DETR2<\/a> or some yolo models like <a href=\"https:\/\/github.com\/MultimediaTechLab\/YOLO\">Yolov9<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">What if you want to customize something in the pipeline? When you build everything from scratch, you should have full control. Otherwise, try choosing a project with a smaller codebase, as a large one can make it difficult to isolate and modify individual components.<\/p>\n<p class=\"wp-block-paragraph\">If you don\u2019t need anything custom and your usage is allowed by the Ultralytics license \u2013 it\u2019s a great repo to use, as it supports multiple tasks (classification, detection, instance segmentation, key points, oriented bounding boxes), models are efficient and achieve good scores. Reiterating ones more, you probably don\u2019t need a custom training pipeline if you are not doing very specific things.<\/p>\n<h2 class=\"wp-block-heading\">Experiments<\/h2>\n<p class=\"wp-block-paragraph\">Let me share some results I got with a custom training pipeline with the D-FINE model and compare it to the Ultralytics YOLO11 model on the <a href=\"https:\/\/paperswithcode.com\/dataset\/visdrone\">VisDrone-DET2019 dataset<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Trained from scratch:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\u00a0 |\u00a0 mAP 0.50.   |\u00a0   F1-score  |  Latency (ms)  |\n---------------------------------+--------------+--------------+------------------|\nYOLO11m TRT \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 | \u00a0 \u00a0 0.417\u00a0 \u00a0 | \u00a0 \u00a0 0.568\u00a0 \u00a0 | \u00a0 \u00a0 \u00a0 15.6\u00a0 \u00a0\u00a0 |\nYOLO11m TRT dynamic       |\u00a0  \u00a0 -\u00a0 \u00a0     | \u00a0 \u00a0 0.568 \u00a0  | \u00a0 \u00a0 \u00a0 13.3\u00a0 \u00a0\u00a0 |\nYOLO11m OV \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\u00a0 |\u00a0 \u00a0 \u00a0 -\u00a0 \u00a0  \u00a0 | \u00a0 \u00a0 0.568  \u00a0 |\u00a0 \u00a0 \u00a0 122.4 \u00a0 \u00a0 |\nD-FINEm TRT \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 |\u00a0 \u00a0 0.457\u00a0 \u00a0  | \u00a0 \u00a0 0.622  \u00a0 | \u00a0 \u00a0 \u00a0 16.6\u00a0  \u00a0 |\nD-FINEm OV \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\u00a0 \u00a0 |\u00a0 \u00a0 0.457\u00a0  \u00a0 | \u00a0 \u00a0 0.622 \u00a0\u00a0 | \u00a0 \u00a0 \u00a0 115.3\u00a0 \u00a0 |<\/code><\/pre>\n<p class=\"wp-block-paragraph\">From COCO pre-trained:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 |\u00a0   mAP 0.50   |\u00a0  F1-score\u00a0 |\n------------------+------------|-------------|\nYOLO11m \u00a0    \u00a0 | \u00a0 \u00a0 0.456 \u00a0 \u00a0 |\u00a0 \u00a0 0.600 \u00a0  |\nD-FINEm \u00a0 \u00a0  \u00a0 | \u00a0 \u00a0 0.506 \u00a0 \u00a0 |\u00a0 \u00a0 0.649 \u00a0 \u00a0|<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Latency was measured on an RTX 3060 with TensorRT (TRT), static image size 640\u00d7640, including the time for\u00a0<code>cv2.imread.<\/code> OpenVINO (OV) on i5 14000f (no iGPU). Dynamic means that during inference, gray padding is being cut for faster inference. It worked with the YOLO11 TensorRT version. More details about cutting gray padding above (<em>Letterbox or simple resize<\/em> section).<\/p>\n<p class=\"wp-block-paragraph\">One disappointing result is the latency on intel N100 CPU with iGPU ($150 miniPC):<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model \u00a0 \u00a0 \u00a0 \u00a0\u00a0 \u00a0 | Latency (ms) |\n------------------+-------------|\nYOLO11m\u00a0 \u00a0     \u00a0 | \u00a0 \u00a0 \u00a0 188\u00a0 \u00a0 |\nD-FINEm\u00a0 \u00a0     \u00a0 | \u00a0 \u00a0 \u00a0 272\u00a0 \u00a0 |\nD-FINEs \u00a0 \u00a0 \u00a0  \u00a0 | \u00a0 \u00a0 \u00a0 11 \u00a0 \u00a0 |<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"54545e\" data-has-transparency=\"true\" style=\"--dominant-color: #54545e;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"171\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU-1024x171.png?resize=1024%2C171&#038;ssl=1\" alt=\"\" class=\"wp-image-599090 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU-1024x171.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU-300x50.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU-768x128.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU-1536x256.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/iGPU.png 2016w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Author\u2019s screenshot of iGPU usage from n100 machine during model inference<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Here, traditional convolutional neural networks are noticeably faster, maybe because of optimizations in OpenVINO for GPUs.<\/p>\n<p class=\"wp-block-paragraph\">Overall, I conducted over 30 experiments with different datasets (including real-world datasets), models, and parameters and I can say that D-FINE gets better metrics. And it makes sense, as on COCO, it is also higher than all YOLO models.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f6f4f4\" data-has-transparency=\"true\" style=\"--dominant-color: #f6f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"848\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/d_fine-1024x848.png?resize=1024%2C848&#038;ssl=1\" alt=\"\" class=\"wp-image-599091 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/d_fine-1024x848.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/d_fine-300x249.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/d_fine-768x636.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/d_fine.png 1050w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/arxiv.org\/pdf\/2410.13842\">D-FINE paper<\/a> comparison to other object detection models<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">VisDrone experiments:\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fbfcfb\" data-has-transparency=\"true\" style=\"--dominant-color: #fbfcfb;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"527\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-1024x527.png?resize=1024%2C527&#038;ssl=1\" alt=\"\" class=\"wp-image-599092 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-1024x527.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-300x154.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-768x395.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-1536x791.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/D-FINE_train-2048x1055.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Author\u2019s metrics logged in WandB, D-FINE model<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fbfafa\" data-has-transparency=\"true\" style=\"--dominant-color: #fbfafa;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"549\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-1024x549.png?resize=1024%2C549&#038;ssl=1\" alt=\"\" class=\"wp-image-599093 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-1024x549.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-300x161.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-768x411.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-1536x823.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Ultr_train-2048x1097.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Author\u2019s metrics logged in WandB, YOLO11 model<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Example of D-FINE model predictions (green \u2013 GT, blue \u2013 pred):\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"5a6866\" data-has-transparency=\"false\" style=\"--dominant-color: #5a6866;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer-1024x576.jpg?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-599094 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer-1024x576.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer-300x169.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer-768x432.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer-1536x864.jpg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/trt_infer.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Final results<\/h2>\n<p class=\"wp-block-paragraph\">Knowing all the details, let\u2019s see a final comparison with the best settings for both models on i12400F and RTX 3060 with the VisDrone dataset:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  \u00a0 | \u00a0 F1-score\u00a0   |   Latency (ms)    |\n-----------------------------------+---------------+-------------------|\nYOLO11m TRT dynamic \u00a0              |\u00a0 \u00a0 \u00a0 0.600\u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 \u00a0 13.3 \u00a0  \u00a0  |\nYOLO11m OV \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0       \u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 0.600\u00a0 \u00a0 | \u00a0 \u00a0 \u00a0 122.4\u00a0  \u00a0 \u00a0 |\nD-FINEs TRT\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0       \u00a0 |\u00a0 \u00a0 \u00a0 0.629\u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 \u00a0 12.3 \u00a0   \u00a0 |\nD-FINEs OV\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0   \u00a0  \u00a0 |\u00a0 \u00a0 \u00a0 0.629\u00a0 \u00a0 |\u00a0 \u00a0 \u00a0 \u00a0 57.4 \u00a0 \u00a0 \u00a0 |<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As shown above, I was able to use a smaller D-FINE model and achieve both faster inference time and accuracy than YOLO11. Beating Ultralytics, the most widely used real-time object detection model, in both speed and accuracy, is quite an accomplishment, isn\u2019t it? The same pattern is observed across several other real-world datasets.<\/p>\n<p class=\"wp-block-paragraph\">I also tried out YOLOv12, which came out while I was writing this article. It performed similarly to YOLO11 and even achieved slightly lower metrics (mAP 0.456 vs 0.452). It appears that YOLO models have been hitting the wall for the last couple of years. D-FINE was a great update for object detection models.<\/p>\n<p class=\"wp-block-paragraph\">Finally, let\u2019s see visually the difference between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"61605d\" data-has-transparency=\"false\" style=\"--dominant-color: #61605d;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_yolo11m-1-1024x576.jpg?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-599096 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_yolo11m-1-1024x576.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_yolo11m-1-300x169.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_yolo11m-1-768x432.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_yolo11m-1.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">D-FINEs, conf 0.5, no nms, latency 12.3ms:\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"61605d\" data-has-transparency=\"false\" style=\"--dominant-color: #61605d;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_s-1024x576.jpg?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-599097 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_s-1024x576.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_s-300x169.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_s-768x432.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_s.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Both Precision and Recall are higher with the D-FINE model. And it\u2019s also faster. Here is also \u201cm\u201d version of D-FINE:\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"61605d\" data-has-transparency=\"false\" style=\"--dominant-color: #61605d;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_m-1-1024x576.jpg?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-599095 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_m-1-1024x576.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_m-1-300x169.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_m-1-768x432.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/infer_high_d_fine_m-1.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Sample from <a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">VisDrone<\/a> dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Isn\u2019t it crazy that even that one car on the left was detected?<\/p>\n<h2 class=\"wp-block-heading\">Attention to data preprocessing<\/h2>\n<p class=\"wp-block-paragraph\">This part can go a little bit outside the scope of the article, but I want to at least quickly mention it, as some parts can be automated and used in the pipeline. What I definitely see as a <a href=\"https:\/\/towardsdatascience.com\/tag\/computer-vision\/\" title=\"Computer Vision\">Computer Vision<\/a> engineer is that when engineers don\u2019t spend time working with the data \u2013 they don\u2019t get good models. You can have all SoTA models and everything done right, but garbage in \u2013 garbage out. So, I always pay a ton of attention to how to approach the task and how to gather, filter, validate, and annotate the data. Don\u2019t think that the annotation team will do everything right. Get your hands dirty and check manually some portion of the dataset to be sure that annotations are good and collected images are representative.<\/p>\n<p class=\"wp-block-paragraph\">Several quick ideas to look into:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Remove duplicates and near duplicates from val\/test sets. The model should not be validated on one sample two times, and definitely, you don\u2019t want to have a data leak, by getting two same images, one in training and one in validation sets.<\/li>\n<li class=\"wp-block-list-item\">Check how small your objects can be. Everything not visible to your eye should not be annotated. Also, remember that augmentations will make objects appear even smaller (for example, mosaic or zoom out). Configure these augmentations accordingly so you won\u2019t end up with unusably small objects on the image.<\/li>\n<li class=\"wp-block-list-item\">When you already have a model for a certain task and need more data \u2013 try using your model to pre-annotate new images. Check cases where the model fails and gather more similar cases.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Where to start<\/h2>\n<p class=\"wp-block-paragraph\">I worked a lot on this pipeline, and I am ready to share it with everyone who wants to try it out. It uses the SoTA D-FINE model under the hood and adds some features that were absent in the original repo (mosaic augmentations, batch accumulation, scheduler, more metrics, visualization of preprocessed images and eval predictions, exporting and inference code, better logging, unified and simplified configuration file).<\/p>\n<p class=\"wp-block-paragraph\">Here is the link to <a href=\"https:\/\/github.com\/ArgoHA\/custom_d_fine\">my repo<\/a>. Here is the <a href=\"https:\/\/github.com\/Peterande\/D-FINE\">original D-FINE repo<\/a>, where I also contribute. If you need any help, please contact me on <a href=\"https:\/\/linkedin.com\/in\/argo-saakyan\/\">LinkedIn<\/a>. Thank you for your time!<\/p>\n<h2 class=\"wp-block-heading\">Citations and acknowledgments<\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/VisDrone\/VisDrone-Dataset\">DroneVis<\/a><\/p>\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\">\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-latex\">@article{zhu2021detection,\n\u00a0\u00a0title={Detection and tracking meet drones challenge},\n\u00a0\u00a0author={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},\n\u00a0\u00a0journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\n\u00a0\u00a0volume={44},\n\u00a0\u00a0number={11},\n\u00a0\u00a0pages={7380--7399},\n\u00a0\u00a0year={2021},\n\u00a0\u00a0publisher={IEEE}\n}<\/code><\/pre>\n<\/div>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2410.13842\">D-FINE<\/a><\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-latex\">@misc{peng2024dfine,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0year={2024},\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eprint={2410.13842},\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0archivePrefix={arXiv},\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0primaryClass={cs.CV}\n}<\/code><\/pre>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/custom-training-pipeline-for-object-detection-models\/\">Custom Training Pipeline for Object Detection Models<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Argo Saakyan<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/custom-training-pipeline-for-object-detection-models\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Custom Training Pipeline for Object Detection Models What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That\u2019s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,221,81,166,70,1498,301],"tags":[489,1972,508],"class_list":["post-2290","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-computer-vision","category-data-preprocessing","category-hands-on-tutorials","category-machine-learning","category-model-training","category-object-detection","tag-detection","tag-object","tag-self"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2290"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2290"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2290\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}