Category: pytorch
-
AI in Multiple GPUs: ZeRO & FSDP
AI in Multiple GPUs: ZeRO & FSDP Learn how Zero Redundancy Optimizer works, how to implement it from scratch, and how to use it in PyTorch The post AI in Multiple GPUs: ZeRO & FSDP appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source
-
YOLOv3 Paper Walkthrough: Even Better, But Not That Much
YOLOv3 Paper Walkthrough: Even Better, But Not That Much A PyTorch implementation on the YOLOv3 architecture from scratch The post YOLOv3 Paper Walkthrough: Even Better, But Not That Much appeared first on Towards Data Science. Muhammad Ardi Go to original source
-
Optimizing Token Generation in PyTorch Decoder Models
Optimizing Token Generation in PyTorch Decoder Models Hiding host-device synchronization via CUDA stream interleaving The post Optimizing Token Generation in PyTorch Decoder Models appeared first on Towards Data Science. Chaim Rand Go to original source
-
AI in Multiple GPUs: Gradient Accumulation & Data Parallelism
AI in Multiple GPUs: Gradient Accumulation & Data Parallelism Learn and implement gradient accum and data parallelism from scratch in PyTorch The post AI in Multiple GPUs: Gradient Accumulation & Data Parallelism appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source
-
AI in Multiple GPUs: Point-to-Point and Collective Operations
AI in Multiple GPUs: Point-to-Point and Collective Operations Learn PyTorch distributed operations for multi GPU AI workloads The post AI in Multiple GPUs: Point-to-Point and Collective Operations appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source
-
AI in Multiple GPUs: Understanding the Host and Device Paradigm
AI in Multiple GPUs: Understanding the Host and Device Paradigm Learn how CPU and GPUs interact in the host-device paradigm The post AI in Multiple GPUs: Understanding the Host and Device Paradigm appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source
-
YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger
YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger From YOLOv1 to YOLOv2: prior box, k-means, Darknet-19, passthrough layer, and more The post YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger appeared first on Towards Data Science. Muhammad Ardi Go to original source
-
Optimizing Data Transfer in Distributed AI/ML Training Workloads
Optimizing Data Transfer in Distributed AI/ML Training Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 3 The post Optimizing Data Transfer in Distributed AI/ML Training Workloads appeared first on Towards Data Science. Chaim Rand Go to original source
-
Optimizing Data Transfer in Batched AI/ML Inference Workloads
Optimizing Data Transfer in Batched AI/ML Inference Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 2 The post Optimizing Data Transfer in Batched AI/ML Inference Workloads appeared first on Towards Data Science. Chaim Rand Go to original source
-
Optimizing Data Transfer in AI/ML Workloads
Optimizing Data Transfer in AI/ML Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems The post Optimizing Data Transfer in AI/ML Workloads appeared first on Towards Data Science. Chaim Rand Go to original source
-
Optimizing PyTorch Model Inference on AWS Graviton
Optimizing PyTorch Model Inference on AWS Graviton Tips for accelerating AI/ML on CPU — Part 2 The post Optimizing PyTorch Model Inference on AWS Graviton appeared first on Towards Data Science. Chaim Rand Go to original source
-
Optimizing PyTorch Model Inference on CPU
Optimizing PyTorch Model Inference on CPU Flyin’ Like a Lion on Intel Xeon The post Optimizing PyTorch Model Inference on CPU appeared first on Towards Data Science. Chaim Rand Go to original source
-
On the Challenge of Converting TensorFlow Models to PyTorch
On the Challenge of Converting TensorFlow Models to PyTorch How to upgrade and optimize legacy AI/ML models The post On the Challenge of Converting TensorFlow Models to PyTorch appeared first on Towards Data Science. Chaim Rand Go to original source
-
Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch
Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch PyTorch Model Performance Analysis and Optimization — Part 11 The post Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch appeared first on Towards Data Science. Chaim Rand Go to original source
-
PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch
PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch Hands-on PyTorch: Building a 3-layer neural network for multiple regression The post PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch appeared first on Towards Data Science. Gustavo Santos Go to original source
-
MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter
MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter MobileNetV3 with PyTorch — now featuring SE blocks and hard activation functions The post MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter appeared first on Towards Data Science. Muhammad Ardi Go to original source
-
How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch
How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch A step-by-step introduction to understanding cancer from the perspective of a data scientist. The post How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch appeared first on Towards Data Science. Adam Streck Go to original source
-
How to Improve the Efficiency of Your PyTorch Training Loop
How to Improve the Efficiency of Your PyTorch Training Loop Learn how to diagnose and resolve bottlenecks in PyTorch using the num_workers, pin_memory, and profiler parameters to maximize training performance. The post How to Improve the Efficiency of Your PyTorch Training Loop appeared first on Towards Data Science. Andrea D’Agostino Go to original source
-
Learning Triton One Kernel At a Time: Vector Addition
Learning Triton One Kernel At a Time: Vector Addition The basics of GPU programming, optimisation, and your first Triton kernel The post Learning Triton One Kernel At a Time: Vector Addition appeared first on Towards Data Science. Ryan Pégoud Go to original source
-
PyTorch Explained: From Automatic Differentiation to Training Custom Neural Networks
PyTorch Explained: From Automatic Differentiation to Training Custom Neural Networks Deep learning is shaping our world as we speak. In fact, it has been slowly revolutionizing software since the early 2010s. In 2025, PyTorch is at the forefront of this revolution, emerging as one of the most important libraries to train neural networks. Whether you…
-
MobileNetV1 Paper Walkthrough: The Tiny Giant
MobileNetV1 Paper Walkthrough: The Tiny Giant Understanding and implementing MobileNetV1 from scratch with PyTorch The post MobileNetV1 Paper Walkthrough: The Tiny Giant appeared first on Towards Data Science. Muhammad Ardi Go to original source
-
Capturing and Deploying PyTorch Models with torch.export
Capturing and Deploying PyTorch Models with torch.export A demonstration of PyTorch’s exciting new export feature on a HuggingFace model The post Capturing and Deploying PyTorch Models with torch.export appeared first on Towards Data Science. Chaim Rand Go to original source
-
Maximizing AI/ML Model Performance with PyTorch Compilation
Maximizing AI/ML Model Performance with PyTorch Compilation Since its inception in PyTorch 2.0 in March 2023, the evolution of torch.compile has been one of the most exciting things to follow. Given that PyTorch’s popularity was due to its “Pythonic” nature, its ease of use, and its line-by-line (a.k.a., eager) execution, the success of a just-in-time (JIT) graph…
-
The Channel-Wise Attention | Squeeze and Excitation
The Channel-Wise Attention | Squeeze and Excitation Applying the Squeeze and Excitation module on ResNeXt using PyTorch The post The Channel-Wise Attention | Squeeze and Excitation appeared first on Towards Data Science. Muhammad Ardi Go to original source
-
Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks
Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks Building a tool to interactively visualize the forward pass of any Pytorch model from within notebooks. The post Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks appeared first on Towards Data Science. Sachin Hosmani Go to original source
-
The Crucial Role of NUMA Awareness in High-Performance Deep Learning
The Crucial Role of NUMA Awareness in High-Performance Deep Learning PyTorch model performance analysis and optimization — Part 10 The post The Crucial Role of NUMA Awareness in High-Performance Deep Learning appeared first on Towards Data Science. Chaim Rand Go to original source
-
How to Fine-Tune Small Language Models to Think with Reinforcement Learning
How to Fine-Tune Small Language Models to Think with Reinforcement Learning A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch The post How to Fine-Tune Small Language Models to Think with Reinforcement Learning appeared first on Towards Data Science. Avishek Biswas Go to original source
-
A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline
A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline PyTorch model performance analysis and optimization — Part 8 The post A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline appeared first on Towards Data Science. Chaim Rand Go to original source
-
Pipelining AI/ML Training Workloads with CUDA Streams
Pipelining AI/ML Training Workloads with CUDA Streams PyTorch Model Performance Analysis and Optimization — Part 9 The post Pipelining AI/ML Training Workloads with CUDA Streams appeared first on Towards Data Science. Chaim Rand Go to original source
-
What PyTorch Really Means by a Leaf Tensor and Its Grad
What PyTorch Really Means by a Leaf Tensor and Its Grad The secret life of leaves, gradients, and the mighty requires_grad flag The post What PyTorch Really Means by a Leaf Tensor and Its Grad appeared first on Towards Data Science. Maciej J. Mikulski Go to original source
-
Use PyTorch to Easily Access Your GPU
Use PyTorch to Easily Access Your GPU Let’s say you are lucky enough to have access to a system with an Nvidia Graphical Processing Unit (Gpu). Did you know there is an absurdly easy method to use your GPU’s capabilities using a Python library intended and predominantly used for machine learning (ML) applications? Don’t worry…
-
The Art of Noise
The Art of Noise Introduction In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for…
-
The Case for Centralized AI Model Inference Serving
The Case for Centralized AI Model Inference Serving As AI models continue to increase in scope and accuracy, even tasks once dominated by traditional algorithms are gradually being replaced by Deep Learning models. Algorithmic pipelines — workflows that take an input, process it through a series of algorithms, and produce an output — increasingly rely…
-
Breaking the Bottleneck: GPU-Optimised Video Processing for Deep Learning
Breaking the Bottleneck: GPU-Optimised Video Processing for Deep Learning Deep Learning (DL) applications often require processing video data for tasks such as object detection, classification, and segmentation. However, conventional video processing pipelines are typically inefficient for deep learning inference, leading to performance bottlenecks. In this post will leverage PyTorch and FFmpeg with NVIDIA hardware acceleration…
-
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics Metric collection is an essential part of every machine learning project, enabling us to track model performance and monitor training progress. Ideally, Metrics should be collected and computed without introducing any additional overhead to the training process. However, just like other components of the…
-
Decoding the Hack behind Accurate Weather Forecasting: Variational Data Assimilation
Decoding the Hack behind Accurate Weather Forecasting: Variational Data Assimilation Learn how to implement the variational data assimilation, with mathematical details and PyTorch for efficient implementation. Continue reading on Towards Data Science » Wencong Yang, PhD Go to original source
-
Efficient Large Dimensional Self-Organising Maps with PyTorch
Efficient Large Dimensional Self-Organising Maps with PyTorch Because it’s fun to self-organise Continue reading on Towards Data Science » Mathieu d’Aquin Go to original source
-
Optimizing Transformer Models for Variable-Length Input Sequences
Optimizing Transformer Models for Variable-Length Input Sequences How PyTorch NestedTensors, FlashAttention2, and xFormers can Boost Performance and Reduce AI Costs Photo by Tanja Zöllner on Unsplash As generative AI (genAI) models grow in both popularity and scale, so do the computational demands and costs associated with their training and deployment. Optimizing these models is crucial for enhancing…