Category: computer-vision

YOLOv3 Paper Walkthrough: Even Better, But Not That Much

YOLOv3 Paper Walkthrough: Even Better, But Not That Much A PyTorch implementation on the YOLOv3 architecture from scratch The post YOLOv3 Paper Walkthrough: Even Better, But Not That Much appeared first on Towards Data Science. Muhammad Ardi Go to original source

March 3, 2026
The Proximity of the Inception Score as an Evaluation Criterion

The Proximity of the Inception Score as an Evaluation Criterion The neighborhood of synthetic data The post The Proximity of the Inception Score as an Evaluation Criterion appeared first on Towards Data Science. Giuseppe Pio Cannata Go to original source

February 4, 2026
I Ditched My Mouse: How I Control My Computer With Hand Gestures (In 60 Lines of Python)

I Ditched My Mouse: How I Control My Computer With Hand Gestures (In 60 Lines of Python) A step-by-step guide to building a “Minority Report”-style interface using OpenCV and MediaPipe The post I Ditched My Mouse: How I Control My Computer With Hand Gestures (In 60 Lines of Python) appeared first on Towards Data Science.…

January 29, 2026
SAM 3 vs. Specialist Models — A Performance Benchmark

SAM 3 vs. Specialist Models — A Performance Benchmark Why specialized models still hold the 30x speed advantage in production environments The post SAM 3 vs. Specialist Models — A Performance Benchmark appeared first on Towards Data Science. Pushpak Bhoge Go to original source

January 26, 2026
From RGB to Lab: Addressing Color Artifacts in AI Image Compositing

From RGB to Lab: Addressing Color Artifacts in AI Image Compositing A multi-tier approach to segmentation, color correction, and domain-specific enhancement The post From RGB to Lab: Addressing Color Artifacts in AI Image Compositing appeared first on Towards Data Science. Eric Chung Go to original source

January 17, 2026
Glitches in the Attention Matrix

Glitches in the Attention Matrix A history of Transformer artifacts and the latest research on how to fix them The post Glitches in the Attention Matrix appeared first on Towards Data Science. Jonathan Williford Go to original source

January 15, 2026
Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example Walkthrough using open-source prompt optimization algorithms in Python to improve the accuracy of an autonomous vehicle car safety agent running on OpenAI’s GPT 5.2 The post Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example appeared first on Towards Data Science. Vincent Koc Go to…

January 12, 2026
How to Improve the Performance of Visual Anomaly Detection Models

How to Improve the Performance of Visual Anomaly Detection Models Apply the best methods from academia to get the most out of practical applications The post How to Improve the Performance of Visual Anomaly Detection Models appeared first on Towards Data Science. Aimira Baitieva Go to original source

January 9, 2026
Feature Detection, Part 3: Harris Corner Detection

Feature Detection, Part 3: Harris Corner Detection Finding the most informative points in images The post Feature Detection, Part 3: Harris Corner Detection appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

January 6, 2026
How Deep Feature Embeddings and Euclidean Similarity Power Automatic Plant Leaf Recognition

How Deep Feature Embeddings and Euclidean Similarity Power Automatic Plant Leaf Recognition Introduction Automatic plant leaf detection is a remarkable innovation in computer vision and machine learning, enabling the identification of plant species by examining a photograph of the leaves. Deep learning is applied to extract meaningful features from an image of leaves and convert…

November 19, 2025
Feature Detection, Part 2: Laplace & Gaussian Operators

Feature Detection, Part 2: Laplace & Gaussian Operators Laplace meets Gaussian — the story of two operators in edge detection The post Feature Detection, Part 2: Laplace & Gaussian Operators appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

November 13, 2025
RF-DETR Under the Hood: The Insights of a Real-Time Transformer Detection

RF-DETR Under the Hood: The Insights of a Real-Time Transformer Detection From rigid grids to adaptive attention, this is the evolutionary path that made detection transformers fast, flexible, and formidable. The post RF-DETR Under the Hood: The Insights of a Real-Time Transformer Detection appeared first on Towards Data Science. David Redó Nieto Go to original…

November 1, 2025
Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator

Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator Applying calculus fundamentals to computer vision for edge detection The post Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

October 17, 2025
Classical Computer Vision and Perspective Transformation for Sudoku Extraction

Classical Computer Vision and Perspective Transformation for Sudoku Extraction Why you shouldn’t overcomplicate solutions to simple problems The post Classical Computer Vision and Perspective Transformation for Sudoku Extraction appeared first on Towards Data Science. Florian Trautweiler Go to original source

October 6, 2025
MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant

MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant Understanding and implementing MobileNetV2 with PyTorch — the next generation of MobileNetV1 The post MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant appeared first on Towards Data Science. Muhammad Ardi Go to original source

October 4, 2025
The SyncNet Research Paper, Clearly Explained

The SyncNet Research Paper, Clearly Explained A Deep Dive into “Out of Time: Automated Lip Sync in the Wild” The post The SyncNet Research Paper, Clearly Explained appeared first on Towards Data Science. Aman Agrawal Go to original source

September 21, 2025
An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers An overview of 4 fundamental computer vision tasks – image classification, image segmentation, image captioning and visual question answering, with transformer models. Compare ViT, DETR, BLIP, and ViLT performance interactively by providing a practical Streamlit app implementation guide. The post An Interactive Guide to…

September 20, 2025
The Hungarian Algorithm and Its Applications in Computer Vision

The Hungarian Algorithm and Its Applications in Computer Vision Introduction Multi-object tracking (MOT) is a task in which an algorithm must detect and track multiple objects in a video. Most known algorithms are based on using simple detectors (e.g. YOLO) designed for processing individual images. The overall method involves separately using a detector on consecutive video…

September 10, 2025
A Refined Training Recipe for Fine-Grained Visual Classification

A Refined Training Recipe for Fine-Grained Visual Classification How FGVC aims to recognize images belonging to multiple subordinate categories of a super-category The post A Refined Training Recipe for Fine-Grained Visual Classification appeared first on Towards Data Science. Ahmed Belgacem Go to original source

August 13, 2025
The Channel-Wise Attention | Squeeze and Excitation

The Channel-Wise Attention | Squeeze and Excitation Applying the Squeeze and Excitation module on ResNeXt using PyTorch The post The Channel-Wise Attention | Squeeze and Excitation appeared first on Towards Data Science. Muhammad Ardi Go to original source

August 8, 2025
FastSAM for Image Segmentation Tasks — Explained Simply

FastSAM for Image Segmentation Tasks — Explained Simply Image segmentation is a popular task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object. Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While…

August 1, 2025
How Do Grayscale Images Affect Visual Anomaly Detection?

How Do Grayscale Images Affect Visual Anomaly Detection? A practical exploration focusing on performance and speed The post How Do Grayscale Images Affect Visual Anomaly Detection? appeared first on Towards Data Science. Aimira Baitieva Go to original source

July 25, 2025
From Rules to Relationships: How Machines Are Learning to Understand Each Other

From Rules to Relationships: How Machines Are Learning to Understand Each Other Using knowledge graphs to handle the unexpected in semantic communication The post From Rules to Relationships: How Machines Are Learning to Understand Each Other appeared first on Towards Data Science. Shireesh Kumar Singh Go to original source

July 23, 2025
Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow A practical approach to instance segmentation using SOLOv2 and TensorFlow The post Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow appeared first on Towards Data Science. Pavel Timonin Go to original source

July 19, 2025
Scene Understanding in Action: Real-World Validation of Multimodal AI Integration

Scene Understanding in Action: Real-World Validation of Multimodal AI Integration A deep dive into real-world case studies: from indoor space and urban streets to world-famous landmarks The post Scene Understanding in Action: Real-World Validation of Multimodal AI Integration appeared first on Towards Data Science. Eric Chung Go to original source

July 11, 2025
Interactive Data Exploration for Computer Vision Projects with Rerun

Interactive Data Exploration for Computer Vision Projects with Rerun Analyse dynamic signals in a computer vision pipeline in Python using OpenCV and Rerun The post Interactive Data Exploration for Computer Vision Projects with Rerun appeared first on Towards Data Science. Florian Trautweiler Go to original source

July 3, 2025
Computer Vision’s Annotation Bottleneck Is Finally Breaking

Computer Vision’s Annotation Bottleneck Is Finally Breaking A Technical Deep Dive into Auto-Labeling The post Computer Vision’s Annotation Bottleneck Is Finally Breaking appeared first on Towards Data Science. TDS Brand Studio Go to original source

June 19, 2025
Grad-CAM from Scratch with PyTorch Hooks

Grad-CAM from Scratch with PyTorch Hooks A hands-on look at an explainable AI (XAI) technique that helps reveal why a convolutional neural network (CNN) made a particular decision The post Grad-CAM from Scratch with PyTorch Hooks appeared first on Towards Data Science. Conor O’Sullivan Go to original source

June 17, 2025
Pairwise Cross-Variance Classification

Pairwise Cross-Variance Classification Multi-class zero-shot embedding classification and error checking The post Pairwise Cross-Variance Classification appeared first on Towards Data Science. Doster Esh Go to original source

June 4, 2025
Vision Transformer on a Budget

Vision Transformer on a Budget Introduction The vanilla ViT is problematic. If you take a look at the original ViT paper [1], you’ll notice that although this deep learning model proved to work extremely well, it requires hundreds of millions of labeled training images to achieve this. Well, that’s a lot. This requirement of an enormous…

June 3, 2025
From RGB to HSV — and Back Again

From RGB to HSV — and Back Again Introduction A fundamental concept in Computer Vision is understanding how images are stored and represented. On disk, image files are encoded in various ways, from lossy, compressed JPEG files to lossless PNG files. Once you load an image into a program and decode it from the respective…

May 8, 2025
The CNN That Challenges ViT

The CNN That Challenges ViT Introduction The invention of ViT (Vision Transformer) causes us to think that CNNs are obsolete. But is this really true? It is widely believed that the impressive performance of ViT comes primarily from its transformer-based architecture. However, researchers from Meta argued that it’s not entirely true. If we take a closer…

May 6, 2025
Diffusion Models, Explained Simply

Diffusion Models, Explained Simply Introduction Generative AI is one of the most popular terms we hear today. Recently, there has been a surge in generative AI applications involving text, image, audio, and video generation. When it comes to image creation, Diffusion models have emerged as a state-of-the-art technique for content generation. Although they were first introduced…

May 6, 2025
Modern GUI Applications for Computer Vision in Python

Modern GUI Applications for Computer Vision in Python Introduction I’m a huge fan of interactive visualizations. As a computer vision engineer, I deal almost daily with image processing related tasks and more often than not I am iterating on a problem where I need visual feedback to make decisions. Let’s think of a very simple image…

May 1, 2025
The Basis of Cognitive Complexity: Teaching CNNs to See Connections

The Basis of Cognitive Complexity: Teaching CNNs to See Connections Liberating education consists in acts of cognition, not transferrals of information. Paulo freire One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing? Many authors suggest that artificial intelligence models do not possess the same…

April 11, 2025
The Art of Noise

The Art of Noise Introduction In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for…

April 3, 2025
The Art of Hybrid Architectures

The Art of Hybrid Architectures In my previous article, I discussed how morphological feature extractors mimic the way biological experts visually assess images. This time, I want to go a step further and explore a new question:Can different architectures complement each other to build an AI that “sees” like an expert? Introduction: Rethinking Model Architecture…

March 29, 2025
Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More

Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More Introduction It’s no news that artificial intelligence has made huge strides in recent years, particularly with the advent of multimodal models that can process and create both text and images, and some very new ones that also process and produce…

March 26, 2025
From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities

From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities Introduction: Can AI really distinguish dog breeds like human experts? One day while taking a walk, I saw a fluffy white puppy and wondered, Is that a Bichon Frise or a Maltese? No matter how closely I looked, they seemed almost identical.…

March 25, 2025
From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities

From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities Introduction: Can AI really distinguish dog breeds like human experts? One day while taking a walk, I saw a fluffy white puppy and wondered, Is that a Bichon Frise or a Maltese? No matter how closely I looked, they seemed almost identical.…

March 11, 2025
Custom Training Pipeline for Object Detection Models

Custom Training Pipeline for Object Detection Models What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That’s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs…

March 8, 2025
On-Device Machine Learning in Spatial Computing

On-Device Machine Learning in Spatial Computing The landscape of computing is undergoing a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this new era, the intersection of virtual reality, Augmented Reality, and on-device machine learning presents unprecedented opportunities for developers to create experiences that seamlessly blend digital content…

February 18, 2025
Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning

Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning Introduction Data science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer…

February 15, 2025
Show and Tell

Show and Tell Photo by Ståle Grut on Unsplash Introduction Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision,…

February 4, 2025
Zero-Shot Player Tracking in Tennis with Kalman Filtering

Zero-Shot Player Tracking in Tennis with Kalman Filtering Automated tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography https://medium.com/media/6f735abc63f905de122bb8a0679f97fd/href With the recent surge in sports tracking projects, many inspired by Skalski’s popular soccer tracking project, there’s been a notable shift towards using automated player tracking for sport hobbyists. Most of these approaches follow a…

January 20, 2025
A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields)

A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields) NeRF overview — Image by Author A Beginner’s 12-Step Visual Guide to Understanding NeRF: Neural Radiance Fields for Scene Representation and View Synthesis A basic understanding of NeRF’s workings through visual representations Who should read this article? This article aims to provide a basic beginner level…

January 16, 2025
Predicting a Ball Trajectory

Predicting a Ball Trajectory Polynomial Fit in Python with NumPy Continue reading on Towards Data Science » Florian Trautweiler Go to original source

January 6, 2025
Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data — Part 2

Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data — Part 2 Mastering Sensor Fusion: Color Image Obstacle Detection with KITTI Data — Part 2 How to use color image data for object detection in the context of obstacle detection The concept of sensor fusion is a decision-making mechanism that can be applied to different problems and using different…

January 3, 2025
Sensor Fusion — KITTI — ‘Lidar-based Obstacle Detection’ — Part-1

Sensor Fusion — KITTI — ‘Lidar-based Obstacle Detection’ — Part-1 Mastering Sensor Fusion: LiDAR Obstacle Detection with KITTI Data — Part 1 How to use Lidar data for obstacle detection with unsupervised learning Sensor fusion, multi-modal perception, autonomous vehicles — if these keywords pique your interest, this Medium blog is for you. Join me as I explore the fascinating world of LiDAR and color image-based environment…

January 3, 2025
Segmenting Water in Satellite Images Using Paligemma

Segmenting Water in Satellite Images Using Paligemma Some insights on using Google’s latest Vision Language Model Hutt Lagoon, Australia. Depending on the season, time of day, and cloud coverage, this lake changes from red to pink or purple. Source: Google Maps. Multimodal models are architectures that simultaneously integrate and process different data types, such as text, images,…

December 30, 2024
Track Computer Vision Experiments with MLflow

Track Computer Vision Experiments with MLflow Discover how to set up an efficient MLflow environment to track your experiments, compare and choose the best model for deployment Continue reading on Towards Data Science » Yağmur Çiğdem Aktaş Go to original source

December 27, 2024
Conditional Variational Autoencoders for Text to Image Generation

Conditional Variational Autoencoders for Text to Image Generation Investigating an early generative architecture and applying it to image generation from text input Recently I was tasked with text-to-image synthesis using a conditional variational autoencoder (CVAE). Being one of the earlier generative structures, it has its limitations but is easily implementable. This article will cover CVAEs at…

December 22, 2024
100 Years of (eXplainable) AI

100 Years of (eXplainable) AI Reflecting on advances and challenges in deep learning and explainability in the ever-evolving era of LLMs and AI governance Image by author Background Imagine you are navigating a self-driving car, relying entirely on its onboard computer to make split-second decisions. It detects objects, identifies pedestrians, and even can anticipate behavior of…

December 19, 2024
CV VideoPlayer — Once and For All

CV VideoPlayer — Once and For All CV VideoPlayer — Once and For All A Python video player package made for computer vision research Image by author When developing computer vision algorithms, the journey from concept to working implementation often involves countless iterations of watching, analyzing, and debugging video frames. As I dove deeper into computer vision projects, I found myself repeatedly…

December 13, 2024
Chat with Your Images using Multimodal LLMs

Chat with Your Images using Multimodal LLMs Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook Annotated image by author. Original image by Pixabay. Introduction The integration of vision capabilities with Large Language Models (LLMs) is revolutionizing…

December 6, 2024
Complete MLOPS Cycle for a Computer Vision Project

Complete MLOPS Cycle for a Computer Vision Project These days, we encounter (and maybe produce on our own) many computer vision projects, where AI is the hottest topic for new technologies… Continue reading on Towards Data Science » Yağmur Çiğdem Aktaş Go to original source

November 29, 2024