Category: Triton

Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels

Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels Why your final LLM layer is OOMing and how to fix it with a custom Triton kernel. The post Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels appeared first on Towards Data Science. Ryan Pégoud Go to original source

January 17, 2026
Breaking the Hardware Barrier: Software FP8 for Older GPUs

Breaking the Hardware Barrier: Software FP8 for Older GPUs Deep learning workloads are increasingly memory-bound, with GPU cores sitting idle while waiting for data transfers. FP8 precision solves this on newer hardware, but what about the millions of RTX 30 and 20 series GPUs already deployed? Feather demonstrates that software-based FP8 emulation through bitwise packing…

December 29, 2025
Learning Triton One Kernel at a Time: Softmax

Learning Triton One Kernel at a Time: Softmax All you need to know about a fast, readable and PyTorch-ready softmax kernel The post Learning Triton One Kernel at a Time: Softmax appeared first on Towards Data Science. Ryan Pégoud Go to original source

November 24, 2025
Learning Triton One Kernel at a Time: Matrix Multiplication

Learning Triton One Kernel at a Time: Matrix Multiplication Tiled GEMM, GPU memory, coalescing, and much more! The post Learning Triton One Kernel at a Time: Matrix Multiplication appeared first on Towards Data Science. Ryan Pégoud Go to original source

October 15, 2025
Learning Triton One Kernel At a Time: Vector Addition

Learning Triton One Kernel At a Time: Vector Addition The basics of GPU programming, optimisation, and your first Triton kernel The post Learning Triton One Kernel At a Time: Vector Addition appeared first on Towards Data Science. Ryan Pégoud Go to original source

September 28, 2025