{"id":7237,"date":"2025-09-30T07:02:29","date_gmt":"2025-09-30T07:02:29","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/09\/30\/2509-22751\/"},"modified":"2025-09-30T07:02:29","modified_gmt":"2025-09-30T07:02:29","slug":"2509-22751","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/09\/30\/2509-22751\/","title":{"rendered":"Variance-Bounded Evaluation without Ground Truth: VB-Score"},"content":{"rendered":"<p>    Variance-Bounded Evaluation without Ground Truth: VB-Score<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>arXiv:2509.22751v1 Announce Type: new<br \/>\nAbstract: Reliable evaluation is a central challenge in machine learning when tasks lack ground truth labels or involve ambiguity and noise. Conventional frameworks, rooted in the Cranfield paradigm and label-based metrics, fail in such cases because they cannot assess how robustly a system performs under uncertain interpretations. We introduce VB-Score, a variance-bounded evaluation framework that measures both effectiveness and robustness without requiring ground truth. Given a query or input, VB-Score enumerates plausible interpretations, assigns probabilities, and evaluates output by expected success penalized by variance, rewarding consistent performance across intents. We provide a formal analysis of VB-Score, establishing range, monotonicity, and stability properties, and relate it to risk-sensitive measures such as mean-variance utility. Experiments on ambiguous queries and entity-centric retrieval tasks show that VB-Score surfaces robustness differences hidden by conventional metrics. By enabling reproducible, label-free evaluation, VB-Score offers a principled foundation for benchmarking machine learning systems in ambiguous or label-scarce domains.<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Kaihua Ding<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/arxiv.org\/abs\/2509.22751\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Variance-Bounded Evaluation without Ground Truth: VB-Score arXiv:2509.22751v1 Announce Type: new Abstract: Reliable evaluation is a central challenge in machine learning when tasks lack ground truth labels or involve ambiguity and noise. Conventional frameworks, rooted in the Cranfield paradigm and label-based metrics, fail in such cases because they cannot assess how robustly a system performs under [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,187,113,112],"tags":[856,659,3923],"class_list":["post-7237","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-cs-ai","category-cs-lg","category-stat-ml","tag-score","tag-variance","tag-vb"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/7237"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=7237"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/7237\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=7237"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=7237"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=7237"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}