Category: llm-evaluation

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics How to evaluate goal-oriented content designed to build engagement and deliver business results, and why structure matters. The post Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics appeared first on Towards Data Science. Diana Schneider Go to original source

January 23, 2026
When Does Adding Fancy RAG Features Work?

When Does Adding Fancy RAG Features Work? Looking at the performance of different pipelines The post When Does Adding Fancy RAG Features Work? appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

January 13, 2026
Measuring What Matters with NeMo Agent Toolkit

Measuring What Matters with NeMo Agent Toolkit A practical guide to observability, evaluations, and model comparisons The post Measuring What Matters with NeMo Agent Toolkit appeared first on Towards Data Science. Mariya Mansurova Go to original source

January 7, 2026
How to Do Evals on a Bloated RAG Pipeline

How to Do Evals on a Bloated RAG Pipeline Comparing metrics across datasets and models The post How to Do Evals on a Bloated RAG Pipeline appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

December 22, 2025
Why AI Alignment Starts With Better Evaluation

Why AI Alignment Starts With Better Evaluation You can’t align what you don’t evaluate The post Why AI Alignment Starts With Better Evaluation appeared first on Towards Data Science. Hailey Quach Go to original source

December 2, 2025
LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models

LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models A step-by-step guide to building AI quality control using large language models The post LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models appeared first on Towards Data Science. Piero Paialunga Go…

November 25, 2025
How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k The third and final part for evaluating the retrieval quality of your RAG pipeline with graded measures The post How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k appeared first on Towards Data Science. Maria Mouschoutzi Go to…

November 13, 2025
How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP)

How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP) Evaluating the retrieval quality of your RAG pipeline with binary, order-aware measures The post How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP) appeared first on Towards Data…

November 6, 2025
How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k

How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k In my previous posts, I have walked you through putting together a very basic RAG pipeline in Python, as well as chunking large text documents. We’ve also looked into how documents are transformed into embeddings, allowing us to quickly search for similar documents…

October 17, 2025
Notes on LLM Evaluation

Notes on LLM Evaluation A practical, step-by-step guide to building an evaluation pipeline for a real-world AI application The post Notes on LLM Evaluation appeared first on Towards Data Science. Felipe Adachi Go to original source

September 26, 2025
5 Techniques to Prevent Hallucinations in Your RAG Question Answering

5 Techniques to Prevent Hallucinations in Your RAG Question Answering Learn how to reduce the number of hallucinations, and the impact they have The post 5 Techniques to Prevent Hallucinations in Your RAG Question Answering appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 24, 2025
Evaluating Your RAG Solution

Evaluating Your RAG Solution A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities. The post Evaluating Your RAG Solution appeared first on Towards Data Science. Alex Davis Go to original source

September 18, 2025
Why Task-Based Evaluations Matter

Why Task-Based Evaluations Matter This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is still an outsized focus in AI literature on foundation model benchmarks.…

September 11, 2025
AI Operations Under the Hood: Challenges and Best Practices

AI Operations Under the Hood: Challenges and Best Practices Building robust, reproducible, and reliable GenAI applications requires a framework of continuous improvement, rigorous evaluation, and systematic validation The post AI Operations Under the Hood: Challenges and Best Practices appeared first on Towards Data Science. Erika G. Gonçalves Go to original source

September 6, 2025
How to Perform Comprehensive Large Scale LLM Validation

How to Perform Comprehensive Large Scale LLM Validation Learn how to validate large scale LLM applications The post How to Perform Comprehensive Large Scale LLM Validation appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 22, 2025
How to Use LLMs for Powerful Automatic Evaluations

How to Use LLMs for Powerful Automatic Evaluations A beginner-friendly introduction to LLM-as-a-Judge The post How to Use LLMs for Powerful Automatic Evaluations appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 14, 2025
Agentic AI: On Evaluations

Agentic AI: On Evaluations Metrics to track for RAG and agents, plus the frameworks that help The post Agentic AI: On Evaluations appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

August 8, 2025
Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare

Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare How metrics and monitoring combine with human expertise to build trustworthy AI in healthcare. The post Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare appeared first on Towards Data Science. Robert Martin-Short Go to original source

July 11, 2025
LLM-as-a-Judge: A Practical Guide

LLM-as-a-Judge: A Practical Guide How to Scale LLM Evaluations Beyond Manual Review The post LLM-as-a-Judge: A Practical Guide appeared first on Towards Data Science. Shuai Guo Go to original source

June 20, 2025
Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning

Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning It’s like grading papers, but your student is an LLM The post Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning appeared first on Towards Data Science. Stephanie Kirmer Go to original source

June 3, 2025
LLM Evaluations: from Prototype to Production

LLM Evaluations: from Prototype to Production Evaluation is the cornerstone of any machine learning product. Investing in quality measurement delivers significant returns. Let’s explore the potential business benefits. As management consultant and writer Peter Drucker once said, “If you can’t measure it, you can’t improve it.” Building a robust evaluation system helps you identify areas…

April 26, 2025
Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

Why Generative-AI Apps’ Quality Often Sucks and What to Do About It How to get from PoCs to tested high-quality applications in production Image licensed from elements.envato.com, edit by Marcel Müller, 2025 The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient,…

January 21, 2025
Evaluation-Driven Development for agentic applications using PydanticAI

Evaluation-Driven Development for agentic applications using PydanticAI An open-source, model-agnostic agentic framework that supports dependency injection Ideally, you can evaluate agentic applications even as you are developing them, instead of evaluation being an afterthought. For this to work, though, you need to be able to mock both internal and external dependencies of the agent you…

December 22, 2024
How to Use Structured Generation for LLM-as-a-Judge Evaluations

How to Use Structured Generation for LLM-as-a-Judge Evaluations Structured generation is fundamental to building complex, multi-step reasoning agents in LLM evaluations — especially for open source models Source: Generated with SDXL 1.0 Disclosure: I am a maintainer of Opik, one of the open source projects used later in this article. For the past few months, I’ve been working on LLM-based…

December 11, 2024