{"id":10090,"date":"2026-01-29T07:02:44","date_gmt":"2026-01-29T07:02:44","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2026\/01\/29\/2601-20251\/"},"modified":"2026-01-29T07:02:44","modified_gmt":"2026-01-29T07:02:44","slug":"2601-20251","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2026\/01\/29\/2601-20251\/","title":{"rendered":"Efficient Evaluation of LLM Performance with Statistical Guarantees"},"content":{"rendered":"<p>    Efficient Evaluation of LLM Performance with Statistical Guarantees<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>arXiv:2601.20251v1 Announce Type: new<br \/>\nAbstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction\/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference &#8212; a finite-population extension of active inference (Zrnic &amp; Candes, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Skyler Wu, Yash Nair, Emmanuel J. Cand&#8217;es<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/arxiv.org\/abs\/2601.20251\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Efficient Evaluation of LLM Performance with Statistical Guarantees arXiv:2601.20251v1 Announce Type: new Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,113,112],"tags":[1095,768,193],"class_list":["post-10090","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-cs-lg","category-stat-ml","tag-active","tag-evaluation","tag-inference"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/10090"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=10090"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/10090\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=10090"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=10090"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=10090"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}