LLMs are Bayesian, in Expectation, not in Realization

LLMs are Bayesian, in Expectation, not in Realization










arXiv:2507.11768v1 Announce Type: new
Abstract: Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications.
Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $Theta(log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = Theta(sqrt{n}log(1/varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.






Leon Chlon, Sarah Rashidi, Zein Khamis, MarcAntonio M. Awada





Go to original source





Posted

in

, ,

by