Tag: attention

Glitches in the Attention Matrix

Glitches in the Attention Matrix A history of Transformer artifacts and the latest research on how to fix them The post Glitches in the Attention Matrix appeared first on Towards Data Science. Jonathan Williford Go to original source

January 15, 2026
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds arXiv:2512.22473v1 Announce Type: new Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed “Bayesian wind tunnels” and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training…

December 30, 2025
We Didn’t Invent Attention — We Just Rediscovered It

We Didn’t Invent Attention — We Just Rediscovered It How selective amplification emerged across evolution, chemistry, and AI through convergent mathematical solutions The post We Didn’t Invent Attention — We Just Rediscovered It appeared first on Towards Data Science. Javier Marin Go to original source

November 6, 2025
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models arXiv:2510.11789v1 Announce Type: new Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2beta}{2beta+1}}$ with $M$ being the sample size, depending…

October 15, 2025
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix arXiv:2510.06685v1 Announce Type: new Abstract: Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of…

October 9, 2025
A Framework for Non-Linear Attention via Modern Hopfield Networks

A Framework for Non-Linear Attention via Modern Hopfield Networks arXiv:2506.11043v1 Announce Type: new Abstract: In this work we propose an energy functional along the lines of Modern Hopfield Networks (MNH), the stationary points of which correspond to the attention due to Vaswani et al. [12], thus unifying both frameworks. The minima of this landscape form…

June 16, 2025
Hands-On Attention Mechanism for Time Series Classification, with Python

Hands-On Attention Mechanism for Time Series Classification, with Python This is how to use the attention mechanism in a time series classification framework The post Hands-On Attention Mechanism for Time Series Classification, with Python appeared first on Towards Data Science. Piero Paialunga Go to original source

May 31, 2025
How Private is Your Attention? Bridging Privacy with In-Context Learning

How Private is Your Attention? Bridging Privacy with In-Context Learning arXiv:2504.16000v1 Announce Type: new Abstract: In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints…

April 23, 2025
Kernel Case Study: Flash Attention

Kernel Case Study: Flash Attention The attention mechanism is at the core of modern day transformers. But scaling the context window of these transformers was a major challenge, and it still is even though we are in the era of a million tokens + context window (Qwen 2.5 [1]). There are both considerable compute and memory…

April 4, 2025
A Simple Implementation of the Attention Mechanism from Scratch

A Simple Implementation of the Attention Mechanism from Scratch Introduction The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the…

April 1, 2025
LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention

LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention arXiv:2503.00387v1 Announce Type: new Abstract: Existing contextual multi-armed bandit (MAB) algorithms fail to effectively capture both long-term trends and local patterns across all arms, leading to suboptimal performance in environments with rapidly changing reward structures. They also rely on static exploration rates, which do not dynamically adjust…

March 4, 2025
Exponential Family Attention

Exponential Family Attention arXiv:2501.16790v1 Announce Type: new Abstract: The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial,…

January 29, 2025
Static and Dynamic Attention: Implications for Graph Neural Networks

Static and Dynamic Attention: Implications for Graph Neural Networks Examining the expressive capacity of Graph Attention Networks Image by the author In graph representation learning, neighborhood aggregation is one of the most well-studied and investigated areas, among which attention-based methods largely remain state-of-the-art. Leveraging learnable attention scores for weighted aggregations, graph attention networks exhibit higher expressivity…

January 15, 2025
Linearizing Llama

Linearizing Llama Speeding up Llama: A hybrid approach to attention mechanisms Source: Image by Author (Generated using Gemini 1.5 Flash) In this article, we will see how to replace softmax self-attention in Llama-3.2-1B with hybrid attention combining softmax sliding window and linear attention. This implementation will help us better understand the growing interest in linear attention…

January 11, 2025
Linearizing Attention

Linearizing Attention Breaking the quadratic barrier: modern alternatives to softmax attention Large Languange Models are great but they have a slight drawback that they use softmax attention which can be computationally intensive. In this article we will explore if there is a way we can replace the softmax somehow to achieve linear time complexity. Image…

December 27, 2024