{"id":9412,"date":"2025-12-30T07:02:39","date_gmt":"2025-12-30T07:02:39","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/12\/30\/2512-22473\/"},"modified":"2025-12-30T07:02:39","modified_gmt":"2025-12-30T07:02:39","slug":"2512-22473","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/12\/30\/2512-22473\/","title":{"rendered":"Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds"},"content":{"rendered":"<p>    Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>arXiv:2512.22473v1 Announce Type: new<br \/>\nAbstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed &#8220;Bayesian wind tunnels&#8221; and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an emph{advantage-based routing law} for attention scores, [ frac{partial L}{partial s_{ij}} = alpha_{ij}bigl(b_{ij}-mathbb{E}_{alpha_i}[b]bigr), qquad b_{ij} := u_i^top v_j, ] coupled with a emph{responsibility-weighted update} for values, [ Delta v_j = -etasum_i alpha_{ij} u_i, ] where $u_i$ is the upstream gradient at position $i$ and $alpha_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/arxiv.org\/abs\/2512.22473\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds arXiv:2512.22473v1 Announce Type: new Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed &#8220;Bayesian wind tunnels&#8221; and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,187,113,112],"tags":[960,379,4521],"class_list":["post-9412","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-cs-ai","category-cs-lg","category-stat-ml","tag-attention","tag-gradient","tag-ij"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/9412"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=9412"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/9412\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=9412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=9412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=9412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}