Tag: evaluation
-
Efficient Evaluation of LLM Performance with Statistical Guarantees
Efficient Evaluation of LLM Performance with Statistical Guarantees arXiv:2601.20251v1 Announce Type: new Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized…
-
Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting arXiv:2512.23805v1 Announce Type: new Abstract: Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen…
-
Bayesian Evaluation of Large Language Model Behavior
Bayesian Evaluation of Large Language Model Behavior arXiv:2511.10661v1 Announce Type: cross Abstract: It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts…
-
A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators
A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators arXiv:2510.16419v1 Announce Type: new Abstract: While significant progress has been made in heterogeneous treatment effect (HTE) estimation, the evaluation of HTE estimators remains underdeveloped. In this article, we propose a robust evaluation framework based on relative error, which quantifies performance differences between two HTE estimators.…
-
Notes on LLM Evaluation
Notes on LLM Evaluation A practical, step-by-step guide to building an evaluation pipeline for a real-world AI application The post Notes on LLM Evaluation appeared first on Towards Data Science. Felipe Adachi Go to original source
-
Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare
Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare How metrics and monitoring combine with human expertise to build trustworthy AI in healthcare. The post Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare appeared first on Towards Data Science. Robert Martin-Short Go to original source
-
A Principled Path to Fitted Distributional Evaluation
A Principled Path to Fitted Distributional Evaluation arXiv:2506.20048v1 Announce Type: new Abstract: In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation — developed for expectation-based reinforcement learning — to…
-
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs arXiv:2505.03814v1 Announce Type: new Abstract: As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of…
-
LLM Evaluations: from Prototype to Production
LLM Evaluations: from Prototype to Production Evaluation is the cornerstone of any machine learning product. Investing in quality measurement delivers significant returns. Let’s explore the potential business benefits. As management consultant and writer Peter Drucker once said, “If you can’t measure it, you can’t improve it.” Building a robust evaluation system helps you identify areas…
-
Improving the evaluation of samplers on multi-modal targets
Improving the evaluation of samplers on multi-modal targets arXiv:2504.08916v1 Announce Type: new Abstract: Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting…
-
Learnings from a Machine Learning Engineer — Part 3: The Evaluation
Learnings from a Machine Learning Engineer — Part 3: The Evaluation In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in…
-
Synthetic Data Generation with LLMs
Synthetic Data Generation with LLMs Popularity of RAG Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value. Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation…
-
Evaluation-Driven Development for agentic applications using PydanticAI
Evaluation-Driven Development for agentic applications using PydanticAI An open-source, model-agnostic agentic framework that supports dependency injection Ideally, you can evaluate agentic applications even as you are developing them, instead of evaluation being an afterthought. For this to work, though, you need to be able to mock both internal and external dependencies of the agent you…
-
How to Choose a Threshold for an Evaluation Metric for Large Language Models
How to Choose a Threshold for an Evaluation Metric for Large Language Models arXiv:2412.12148v1 Announce Type: new Abstract: To ensure and monitor large language models (LLMs) reliably, various evaluation metrics have been proposed in the literature. However, there is little research on prescribing a methodology to identify a robust threshold on these metrics even though…