Tag: evaluation

Efficient Evaluation of LLM Performance with Statistical Guarantees

Efficient Evaluation of LLM Performance with Statistical Guarantees arXiv:2601.20251v1 Announce Type: new Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized…

January 29, 2026
Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting arXiv:2512.23805v1 Announce Type: new Abstract: Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen…

January 1, 2026
Bayesian Evaluation of Large Language Model Behavior

Bayesian Evaluation of Large Language Model Behavior arXiv:2511.10661v1 Announce Type: cross Abstract: It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts…

November 17, 2025
A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators

A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators arXiv:2510.16419v1 Announce Type: new Abstract: While significant progress has been made in heterogeneous treatment effect (HTE) estimation, the evaluation of HTE estimators remains underdeveloped. In this article, we propose a robust evaluation framework based on relative error, which quantifies performance differences between two HTE estimators.…

October 21, 2025
Notes on LLM Evaluation

Notes on LLM Evaluation A practical, step-by-step guide to building an evaluation pipeline for a real-world AI application The post Notes on LLM Evaluation appeared first on Towards Data Science. Felipe Adachi Go to original source

September 26, 2025
Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare

Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare How metrics and monitoring combine with human expertise to build trustworthy AI in healthcare. The post Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare appeared first on Towards Data Science. Robert Martin-Short Go to original source

July 11, 2025
A Principled Path to Fitted Distributional Evaluation

A Principled Path to Fitted Distributional Evaluation arXiv:2506.20048v1 Announce Type: new Abstract: In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation — developed for expectation-based reinforcement learning — to…

June 26, 2025
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs arXiv:2505.03814v1 Announce Type: new Abstract: As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of…

May 8, 2025
LLM Evaluations: from Prototype to Production

LLM Evaluations: from Prototype to Production Evaluation is the cornerstone of any machine learning product. Investing in quality measurement delivers significant returns. Let’s explore the potential business benefits. As management consultant and writer Peter Drucker once said, “If you can’t measure it, you can’t improve it.” Building a robust evaluation system helps you identify areas…

April 26, 2025
Improving the evaluation of samplers on multi-modal targets

Improving the evaluation of samplers on multi-modal targets arXiv:2504.08916v1 Announce Type: new Abstract: Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting…

April 15, 2025
Learnings from a Machine Learning Engineer — Part 3: The Evaluation

Learnings from a Machine Learning Engineer — Part 3: The Evaluation In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in…

February 14, 2025
Synthetic Data Generation with LLMs

Synthetic Data Generation with LLMs Popularity of RAG Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value. Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation…

February 8, 2025
Evaluation-Driven Development for agentic applications using PydanticAI

Evaluation-Driven Development for agentic applications using PydanticAI An open-source, model-agnostic agentic framework that supports dependency injection Ideally, you can evaluate agentic applications even as you are developing them, instead of evaluation being an afterthought. For this to work, though, you need to be able to mock both internal and external dependencies of the agent you…

December 22, 2024
How to Choose a Threshold for an Evaluation Metric for Large Language Models

How to Choose a Threshold for an Evaluation Metric for Large Language Models arXiv:2412.12148v1 Announce Type: new Abstract: To ensure and monitor large language models (LLMs) reliably, various evaluation metrics have been proposed in the literature. However, there is little research on prescribing a methodology to identify a robust threshold on these metrics even though…

December 18, 2024