Tag: attention
-
Glitches in the Attention Matrix
Glitches in the Attention Matrix A history of Transformer artifacts and the latest research on how to fix them The post Glitches in the Attention Matrix appeared first on Towards Data Science. Jonathan Williford Go to original source
-
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds arXiv:2512.22473v1 Announce Type: new Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed “Bayesian wind tunnels” and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training…
-
We Didn’t Invent Attention — We Just Rediscovered It
We Didn’t Invent Attention — We Just Rediscovered It How selective amplification emerged across evolution, chemistry, and AI through convergent mathematical solutions The post We Didn’t Invent Attention — We Just Rediscovered It appeared first on Towards Data Science. Javier Marin Go to original source
-
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models arXiv:2510.11789v1 Announce Type: new Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2beta}{2beta+1}}$ with $M$ being the sample size, depending…
-
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix arXiv:2510.06685v1 Announce Type: new Abstract: Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of…
-
A Framework for Non-Linear Attention via Modern Hopfield Networks
A Framework for Non-Linear Attention via Modern Hopfield Networks arXiv:2506.11043v1 Announce Type: new Abstract: In this work we propose an energy functional along the lines of Modern Hopfield Networks (MNH), the stationary points of which correspond to the attention due to Vaswani et al. [12], thus unifying both frameworks. The minima of this landscape form…
-
Hands-On Attention Mechanism for Time Series Classification, with Python
Hands-On Attention Mechanism for Time Series Classification, with Python This is how to use the attention mechanism in a time series classification framework The post Hands-On Attention Mechanism for Time Series Classification, with Python appeared first on Towards Data Science. Piero Paialunga Go to original source
-
How Private is Your Attention? Bridging Privacy with In-Context Learning
How Private is Your Attention? Bridging Privacy with In-Context Learning arXiv:2504.16000v1 Announce Type: new Abstract: In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints…
-
Kernel Case Study: Flash Attention
Kernel Case Study: Flash Attention The attention mechanism is at the core of modern day transformers. But scaling the context window of these transformers was a major challenge, and it still is even though we are in the era of a million tokens + context window (Qwen 2.5 [1]). There are both considerable compute and memory…
-
A Simple Implementation of the Attention Mechanism from Scratch
A Simple Implementation of the Attention Mechanism from Scratch Introduction The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the…
-
LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention
LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention arXiv:2503.00387v1 Announce Type: new Abstract: Existing contextual multi-armed bandit (MAB) algorithms fail to effectively capture both long-term trends and local patterns across all arms, leading to suboptimal performance in environments with rapidly changing reward structures. They also rely on static exploration rates, which do not dynamically adjust…
-
Exponential Family Attention
Exponential Family Attention arXiv:2501.16790v1 Announce Type: new Abstract: The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial,…
-
Static and Dynamic Attention: Implications for Graph Neural Networks
Static and Dynamic Attention: Implications for Graph Neural Networks Examining the expressive capacity of Graph Attention Networks Image by the author In graph representation learning, neighborhood aggregation is one of the most well-studied and investigated areas, among which attention-based methods largely remain state-of-the-art. Leveraging learnable attention scores for weighted aggregations, graph attention networks exhibit higher expressivity…
-
Linearizing Llama
Linearizing Llama Speeding up Llama: A hybrid approach to attention mechanisms Source: Image by Author (Generated using Gemini 1.5 Flash) In this article, we will see how to replace softmax self-attention in Llama-3.2-1B with hybrid attention combining softmax sliding window and linear attention. This implementation will help us better understand the growing interest in linear attention…
-
Linearizing Attention
Linearizing Attention Breaking the quadratic barrier: modern alternatives to softmax attention Large Languange Models are great but they have a slight drawback that they use softmax attention which can be computationally intensive. In this article we will explore if there is a way we can replace the softmax somehow to achieve linear time complexity. Image…