Tag: evaluations

  • Why Task-Based Evaluations Matter

    Why Task-Based Evaluations Matter This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is still an outsized focus in AI literature on foundation model benchmarks.…

  • Agentic AI: On Evaluations

    Agentic AI: On Evaluations Metrics to track for RAG and agents, plus the frameworks that help The post Agentic AI: On Evaluations appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

  • Fast Bayesian Optimization of Function Networks with Partial Evaluations

    Fast Bayesian Optimization of Function Networks with Partial Evaluations arXiv:2506.11456v1 Announce Type: new Abstract: Bayesian optimization of function networks (BOFN) is a framework for optimizing expensive-to-evaluate objective functions structured as networks, where some nodes’ outputs serve as inputs for others. Many real-world applications, such as manufacturing and drug discovery, involve function networks with additional properties…

  • Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations

    Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations arXiv:2504.11554v1 Announce Type: new Abstract: Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference…

  • How to Use Structured Generation for LLM-as-a-Judge Evaluations

    How to Use Structured Generation for LLM-as-a-Judge Evaluations Structured generation is fundamental to building complex, multi-step reasoning agents in LLM evaluations — especially for open source models Source: Generated with SDXL 1.0 Disclosure: I am a maintainer of Opik, one of the open source projects used later in this article. For the past few months, I’ve been working on LLM-based…