{"id":3419,"date":"2025-04-29T07:02:23","date_gmt":"2025-04-29T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/29\/how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do\/"},"modified":"2025-04-29T07:02:23","modified_gmt":"2025-04-29T07:02:23","slug":"how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/29\/how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do\/","title":{"rendered":"How to Ensure Your AI Solution Does What You Expect iI to Do"},"content":{"rendered":"<p>    How to Ensure Your AI Solution Does What You Expect iI to Do<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1745889774796\" class=\"mdspan-comment\">Generative AI (<\/mdspan>GenAI) is evolving fast \u2014 and it\u2019s no longer just about fun chatbots or impressive image generation. 2025 is the year where the focus is on turning the AI hype into real value. Companies everywhere are looking into ways to integrate and leverage GenAI on their products and processes \u2014 to better serve users, boost efficiency, stay competitive, and drive growth. And thanks to APIs and pre-trained models from major providers, integrating GenAI feels easier than ever before. But here\u2019s the catch: <strong>just because integration is easy, doesn\u2019t mean AI solutions will work as intended once deployed.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Predictive models aren\u2019t really new: as humans we have been predicting things for years, starting formaly with statistics. However, <strong>GenAI has revolutionized the predictive field for many reasons<\/strong>:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">No need to train your own model or to be a Data Scientist to build AI solutions<\/li>\n<li class=\"wp-block-list-item\">AI is now easy to use through chat interfaces and to integrate through APIs<\/li>\n<li class=\"wp-block-list-item\">Unlocking of many things that couldn\u2019t be done or were really hard to do before<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">All these things make <strong>GenAI very exciting, but also risky<\/strong>.\u00a0 Unlike traditional software \u2014 or even classical machine learning \u2014 GenAI introduces a new level of unpredictability. You\u2019re not implementic deterministic logics, you\u2019re using a model trained on vast amounts of data, hoping it will respond as needed. So how do we know if an AI system is doing what we intend it to do? How do we know if it\u2019s ready to go live? The answer is <a href=\"https:\/\/towardsdatascience.com\/tag\/evaluations\/\" title=\"Evaluations\">Evaluations<\/a> (evals), the concept that we\u2019ll be exploring in this post:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Why <a href=\"https:\/\/towardsdatascience.com\/tag\/genai\/\" title=\"Genai\">Genai<\/a> systems can\u2019t be tested the same way as traditional software or even classical Machine Learning (ML)<\/li>\n<li class=\"wp-block-list-item\">Why evaluations are key to understand the quality of your AI system and aren\u2019t optional (unless you like surprises)<\/li>\n<li class=\"wp-block-list-item\">Different types of evaluations and techniques to apply them in practice<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Whether you\u2019re a Product Manager, Engineer, or anyone working or interested in AI, I hope this post will help you understand how to think critically about AI systems quality (and why evals are key to achieve that quality!).<\/p>\n<h2 class=\"wp-block-heading\">GenAI Can\u2019t Be Tested Like Traditional Software\u2014 Or Even Classical ML<\/h2>\n<p class=\"wp-block-paragraph\"><strong>In traditional software development<\/strong>, systems follow deterministic logics: <strong>if X happens, then Y will happen<\/strong> \u2014 always. Unless something breaks in your platform or you introduce an error in the code\u2026 which is the reason you add tests, monitoring and alerts. Unit tests are used to validate small blocks of code, integration tests to ensure components work well together, and monitoring to detect if something breaks in production. Testing traditional software is like checking if a calculator works. You input 2 + 2, and you expect 4. Clear and deterministic, it\u2019s either right or wrong.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">However, ML and AI introduce non-determinism and probabilities. Instead of defining behavior explicitly through rules, we train models to learn patterns from data. <strong>In AI, if X happens, the output is no longer a hard-coded Y, but a prediction with a certain degree of probability, based on what the model learned during training<\/strong>. This can be very powerful, but also introduces uncertainty: identical inputs might have different outputs over time, plausible outputs might actually be incorrect, unexpected behavior for rare scenarios might arise\u2026\u00a0<\/p>\n<p class=\"wp-block-paragraph\">This makes traditional testing approaches insufficient, not even plausible at times. The calculator example gets closer to trying to evaluate a student\u2019s performance on an open-ended exam. For each question, and many possible ways to answer the question, is an answer provided correct? Is it above the level of knowledge the student should have? Did the student make everything up but sound very convincing? Just like answers in an exam, <strong>AI systems can be evaluated, but need a more general and flexible way to adapt to different inputs, contexts and use cases <\/strong>(or types of exams).<\/p>\n<p class=\"wp-block-paragraph\"><strong>In traditional <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> (ML), evaluations are already a well-established part of the project lifecycle<\/strong>. Training a model on a narrow task like loan approval or disease detection always includes an evaluation step \u2013 using metrics like accuracy, precision, RMSE, MAE\u2026 This is used to measure how well the model performs, to compare between different model options, and to decide if the model is good enough to move forward to deployment.\u00a0In GenAI this usually changes: teams use models that are already trained and have already passed general-purpose evaluations both internally on the model provider side and on public benchmarks. These models are so good at general tasks \u2013 like answering questions or drafting emails \u2013 there\u2019s a risk of overtrusting them for our specific use case. However, it is important to still ask \u201c<em>is this amazing model good enough for my use case?<\/em>\u201d.\u00a0 That\u2019s where evaluation comes in <strong>\u2013 <\/strong>to assess whether preditcions or generations are good for your specific use case, context, inputs and users.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"394\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Captura-de-pantalla-2025-04-26-a-las-21.41.19-1024x394.png?resize=1024%2C394&#038;ssl=1\" alt=\"\" class=\"wp-image-602409\"><figcaption class=\"wp-element-caption\">Training and evals \u2013 traditional ML vs GenAI, image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">There is another big difference between ML and GenAI: the variety and complexity of the model outputs. We are no longer returning classes and probabilities (like probability a client will return the loan), or numbers (like predicted house price based on its characteristics). GenAI systems can return many types of output, of different lengths, tone, content, and format.\u00a0 Similarly, these models no longer require structured and very determined input, but can usually take nearly any type of input \u2014 text, images, even audio or video. Evaluating therefore becomes much harder.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"365\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Captura-de-pantalla-2025-04-26-a-las-21.42.32-1024x365.png?resize=1024%2C365&#038;ssl=1\" alt=\"\" class=\"wp-image-602410\"><figcaption class=\"wp-element-caption\">Input \/ output relationship \u2013 statistics &amp; traditional ML vs GenAI, image by author<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Why Evals aren\u2019t Optional (Unless You Like Surprises)<\/h2>\n<p class=\"wp-block-paragraph\">Evals help you measure whether your AI system is actually working the way you <em>want<\/em> it to, whether the system is ready to go live, and if once live it keeps performing as expected. Breaking down why evals are essential:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Quality Assessment:<\/strong> Evals provide a structured way to understand the quality of your AI\u2019s predictions or outputs and how they will integrate in the overall system and use case. Are responses accurate? Helpful? Coherent? Relevant?\u00a0\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Error Quantification:<\/strong> Evaluations help quantify the percentage, types, and magnitudes of errors. How often things go wrong? What kinds of errors occur more frequently (e.g. false positives, hallucinations, formatting mistakes)?<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Risk Mitigation:<\/strong> Helps you spot and prevent harmful or biased behavior before it reaches users \u2014 protecting your company from reputational risk, ethical issues, and potential regulatory problems.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Generative AI, with its free input-output relationships and long text generation, makes evaluations even more critical and complex. When things go wrong, they can go very wrong. We\u2019ve all seen headlines about chatbots giving dangerous advice, models generating biased content, or AI tools hallucinating false facts.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201c<em>AI will never be perfect, but with evals you can reduce the risk of embarrassment \u2013 which can cost you money, credibility, or a viral moment on Twitter.<\/em>\u201c<\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">How Do You Define an Evaluation Strategy?<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"627\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Captura-de-pantalla-2025-04-26-a-las-21.44.14-1-1024x627.png?resize=1024%2C627&#038;ssl=1\" alt=\"\" class=\"wp-image-602412\"><figcaption class=\"wp-element-caption\">Image by <a href=\"https:\/\/unsplash.com\/es\/@akshayspaceship\">akshayspaceship<\/a> on <a href=\"https:\/\/unsplash.com\/\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">So how do we define our evaluations? Evals aren\u2019t one-size-fits-all. They are use-case dependent and should align with the specific goals of your AI application. If you\u2019re building a search engine, you might care about result relevance. If it\u2019s a chatbot, you might care about helpfulness and safety. If it\u2019s a classifier, you probably care about accuracy and precision. For systems with multiple steps (like an AI system that performs search, prioritizes results and then generates an answer) it\u2019s often necessary to evaluate each step. The idea here is to measure if each step is helping reach the general success metric (and through this understand where to focus iterations and improvements).\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Common evaluation areas include:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Correctness &amp; Hallucinations:<\/strong> Are the outputs factually accurate? Are they making things up?<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Relevance:<\/strong> Is the content aligned with the user\u2019s query or the provided context?<\/li>\n<li class=\"wp-block-list-item\">safety, bias, and toxicity<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Format: <\/strong>Are outputs in the expected format (e.g., JSON, valid function call)?<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Safety, Bias &amp; Toxicity:<\/strong> Is the system generating harmful, biased, or toxic content?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Task-Specific Metrics. <\/strong>For example in classification tasks measures such as accuracy and precision, in summarization tasks ROUGE or BLEU, and in code generation tasks regex and execution without error check.<\/p>\n<h2 class=\"wp-block-heading\">How Do You Actually Compute Evals?<\/h2>\n<p class=\"wp-block-paragraph\">Once you know what you want to measure, the next step is designing your test cases. This will be a set of examples (the more examples the better, but always balancing value and costs) where you have:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Input example<\/strong>:\u00a0 A realistic input of your system once in production.\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Expected Output<\/strong> (if applicable): Ground truth or example of desirable results.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Evaluation Method:<\/strong> A scoring mechanism to assess the result.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Score or Pass\/Fail<\/strong>: computed metric that evaluates your test case<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Depending on your needs, time, and budget, there are several techniques you can use as evaluation methods:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Statistical Scorers like<\/strong> BLEU, ROUGE, METEOR, or cosine similarity between embeddings \u2014 good for comparing generated text to reference outputs.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Traditional ML Metrics like <\/strong>Accuracy, precision, recall, and AUC \u2014 best for classification with labeled data.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>LLM-as-a-Judge <\/strong>Use a large language model to rate outputs (e.g., \u201c<em>Is this answer correct and helpful?<\/em>\u201d). Especially useful when labeled data isn\u2019t available or when evaluating open-ended generation.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Code-Based Evals <\/strong>Use regex, logic rules, or test case execution to validate formats.<\/p>\n<h2 class=\"wp-block-heading\">Wrapping it up<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s bring everything together with a concrete example. Imagine you\u2019re building a sentiment analysis system to help your customer support team prioritize incoming emails.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The goal is to make sure the most urgent or negative messages get faster responses \u2014 ideally reducing frustration, improving satisfaction, and decreasing churn. This is a relatively simple use case, but even in a system like this, with limited outputs, quality matters: bad predictions could lead to prioritizing emails randomly, meaning your team wastes time with a system that costs money.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">So how do you know your solution is working with the needed quality? You evaluate. Here are some examples of things that might be relevant to assess in this specific use case:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Format Validation: <\/strong>Are the outputs of the LLM call to predict the sentiment of the email returned in the expected JSON format? This can be evaluated via code-based checks: regex, schema validation, etc.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Sentiment Classification Accuracy: <\/strong>Is the system correctly classifying sentiments across a range of texts \u2014 short, long, multilingual? This can be evaluated with labeled data using traditional ML metrics \u2014 or, if labels aren\u2019t available, using LLM-as-a-judge.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Once the solution is live, you would want to include also metrics that are more related to the final impact of your solution<em>:<\/em><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Prioritization Effectiveness: <\/strong>Are support agents actually being guided toward the most critical emails? Is the prioritization aligned with the desired business impact?<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Final Business Impact<\/strong> Over time, is this system reducing response times, lowering customer churn, and improving satisfaction scores?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Evals are key to ensure we build useful, safe, valuable, and user-ready AI systems in production. <\/strong>So, whether you\u2019re working with a simple classifier or an open ended chatbot, take the time to define what \u201cgood enough\u201d means (Minimum Viable Quality) \u2014 and build the evals around it to measure it!<\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] <a href=\"https:\/\/hamel.dev\/blog\/posts\/evals\/\">Your AI Product Needs Evals<\/a>, Hamel Husain<\/p>\n<p class=\"wp-block-paragraph\">[2] <a href=\"https:\/\/www.confident-ai.com\/blog\/llm-evaluation-metrics-everything-you-need-for-llm-evaluation\">LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide, Confident AI<\/a><\/p>\n<p class=\"wp-block-paragraph\">[3] <a href=\"https:\/\/www.deeplearning.ai\/short-courses\/evaluating-ai-agents\/\">Evaluating AI Agents, deeplearning.ai + Arize<\/a><\/p>\n<p class=\"wp-block-paragraph\"><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do\/\">How to Ensure Your AI Solution Does What You Expect iI to Do<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Anna Via<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Ensure Your AI Solution Does What You Expect iI to Do Generative AI (GenAI) is evolving fast \u2014 and it\u2019s no longer just about fun chatbots or impressive image generation. 2025 is the year where the focus is on turning the AI hype into real value. Companies everywhere are looking into ways to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,240,2505,77,70],"tags":[98,385,1515],"class_list":["post-3419","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-editors-pick","category-evaluations","category-genai","category-machine-learning","tag-ai","tag-do","tag-genai"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3419"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3419"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3419\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3419"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3419"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3419"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}