{"id":702,"date":"2024-12-20T07:02:34","date_gmt":"2024-12-20T07:02:34","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/20\/from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b\/"},"modified":"2024-12-20T07:02:34","modified_gmt":"2024-12-20T07:02:34","slug":"from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/20\/from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b\/","title":{"rendered":"From Prototype to Production: Enhancing LLM Accuracy"},"content":{"rendered":"<p>    From Prototype to Production: Enhancing LLM Accuracy<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Implementing evaluation frameworks to optimize accuracy in real-world applications<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Aal-9CqM7KGRk_ZE7oCMGwg.jpeg?ssl=1\"><figcaption>Image created by DALL-E\u00a03<\/figcaption><\/figure>\n<p>Building a prototype for an LLM application is surprisingly straightforward. You can often create a functional first version within just a few hours. This initial prototype will likely provide results that look legitimate and be a good tool to demonstrate your approach. However, this is usually not enough for production use.<\/p>\n<p>LLMs are probabilistic by nature, as they generate tokens based on the distribution of likely continuations. This means that in many cases, we get the answer close to the \u201ccorrect\u201d one from the distribution. Sometimes, this is acceptable\u200a\u2014\u200afor example, it doesn\u2019t matter whether the app says \u201cHello, John!\u201d or \u201cHi, John!\u201d. In other cases, the difference is critical, such as between \u201cThe revenue in 2024 was 20M USD\u201d and \u201cThe revenue in 2024 was 20M\u00a0GBP\u201d.<\/p>\n<p>In many real-world business scenarios, precision is crucial, and \u201calmost right\u201d isn\u2019t good enough. For example, when your LLM application needs to execute API calls, or you\u2019re doing a summary of financial reports. From my experience, ensuring the accuracy and consistency of results is far more complex and time-consuming than building the initial prototype.<\/p>\n<p>In this article, I will discuss how to approach measuring and improving accuracy. We\u2019ll build an SQL Agent where precision is vital for ensuring that queries are executable. Starting with a basic prototype, we\u2019ll explore methods to measure accuracy and test various techniques to enhance it, such as self-reflection and retrieval-augmented generation (RAG).<\/p>\n<h3>Setup<\/h3>\n<p>As usual, let\u2019s begin with the setup. The core components of our SQL agent solution are the LLM model, which generates queries, and the SQL database, which executes\u00a0them.<\/p>\n<h4>LLM model\u200a\u2014\u200aLlama<\/h4>\n<p>For this project, we will use an open-source Llama model released by Meta. I\u2019ve chosen <a href=\"https:\/\/ollama.com\/library\/llama3.1:8b\">Llama 3.1 8B<\/a> because it is lightweight enough to run on my laptop while still being quite powerful (refer to the <a href=\"https:\/\/ai.meta.com\/blog\/meta-llama-3-1\/\">documentation<\/a> for details).<\/p>\n<p>If you haven\u2019t installed it yet, you can find guides <a href=\"https:\/\/www.llama.com\/docs\/llama-everywhere\">here<\/a>. I use it locally on MacOS via <a href=\"https:\/\/ollama.com\/\">Ollama<\/a>. Using the following command, we can download the\u00a0model.<\/p>\n<pre>ollama pull llama3.1:8b<\/pre>\n<p>We will use Ollama with <a href=\"https:\/\/python.langchain.com\/docs\/how_to\/local_llms\/\">LangChain<\/a>, so let\u2019s start by installing the required\u00a0package.<\/p>\n<pre>pip install -qU langchain_ollama <\/pre>\n<p>Now, we can run the Llama model and see the first\u00a0results.<\/p>\n<pre>from langchain_ollama import OllamaLLM<br><br>llm = OllamaLLM(model=\"llama3.1:8b\")<br>llm.invoke(\"How are you?\")<br># I'm just a computer program, so I don't have feelings or emotions <br># like humans do. I'm functioning properly and ready to help with <br># any questions or tasks you may have! How can I assist you today?<\/pre>\n<p>We would like to pass a system message alongside customer questions. So, following <a href=\"https:\/\/www.llama.com\/docs\/model-cards-and-prompt-formats\/llama3_1\">the Llama 3.1 model documentation<\/a>, let\u2019s put together a helper function to construct a prompt and test this function.<\/p>\n<pre>def get_llama_prompt(user_message, system_message=\"\"):<br>  system_prompt = \"\"<br>  if system_message != \"\":<br>    system_prompt = (<br>      f\"&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;nn{system_message}\"<br>      f\"&lt;|eot_id|&gt;\"<br>    )<br>  prompt = (f\"&lt;|begin_of_text|&gt;{system_prompt}\"<br>            f\"&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;nn\"<br>            f\"{user_message}\"<br>            f\"&lt;|eot_id|&gt;\"<br>            f\"&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;nn\"<br>           )<br>  return prompt   <br><br><br>system_prompt = '''<br>You are Rudolph, the spirited reindeer with a glowing red nose, <br>bursting with excitement as you prepare to lead Santa's sleigh <br>through snowy skies. Your joy shines as brightly as your nose, <br>eager to spread Christmas cheer to the world!<br>Please, answer questions concisely in 1-2 sentences.<br>'''<br>prompt = get_llama_prompt('How are you?', system_prompt)<br>llm.invoke(prompt)<br><br># I'm feeling jolly and bright, ready for a magical night! <br># My shiny red nose is glowing brighter than ever, just perfect <br># for navigating through the starry skies. <\/pre>\n<p>The new system prompt has changed the answer significantly, so it works. With this, our local LLM setup is ready to\u00a0go.<\/p>\n<h4>Database\u200a\u2014\u200aClickHouse<\/h4>\n<p>I will use an open-source database <a href=\"https:\/\/clickhouse.com\/\">ClickHouse<\/a>. I\u2019ve chosen ClickHouse because it has a specific SQL dialect. LLMs have likely encountered fewer examples of this dialect during training, making the task a bit more challenging. However, you can choose any other database.<\/p>\n<blockquote><p>Installing ClickHouse is pretty straightforward\u200a\u2014\u200ajust follow the instructions provided in <a href=\"https:\/\/clickhouse.com\/docs\/en\/getting-started\/quick-start\">the documentation<\/a>.<\/p><\/blockquote>\n<p>We will be working with two tables: ecommerce.users and ecommerce.sessions. These tables contain fictional data, including customer personal information and their session activity on the e-commerce website.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AVhBb9MuLpUdPsVAhZjQRHA.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ApOK6nZngqZ0NIQk0ZrMycA.png?ssl=1\"><\/figure>\n<blockquote><p>You can find the code for generating synthetic data and uploading it on\u00a0<a href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/analyst_agent\/generate_synthetic_data_for_sql.ipynb\">GitHub<\/a>.<\/p><\/blockquote>\n<p>With that, the setup is complete, and we\u2019re ready to move on to building the basic prototype.<\/p>\n<h3>The first prototype<\/h3>\n<p>As discussed, our goal is to build an SQL Agent\u200a\u2014\u200aan application that generates SQL queries to answer customer questions. In the future, we can add another layer to this system: executing the SQL query, passing both the initial question and the database results back to the LLM, and asking it to generate a human-friendly answer. However, for this article, we\u2019ll focus on the first\u00a0step.<\/p>\n<p>The best practice with LLM applications (similar to any other complex tasks) is to start simple and then iterate. The most straightforward implementation is to do one LLM call and share all the necessary information (such as schema description) in the system prompt. So, the first step is to put together the\u00a0prompt.<\/p>\n<pre>generate_query_system_prompt = '''<br>You are a senior data analyst with more than 10 years of experience writing complex SQL queries. <br>There are two tables in the database with the following schemas. <br><br>Table: ecommerce.users <br>Description: customers of the online shop<br>Fields: <br>- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004<br>- country (string) - country of residence, for example, \"Netherlands\" or \"United Kingdom\"<br>- is_active (integer) - 1 if customer is still active and 0 otherwise<br>- age (integer) - customer age in full years, for example, 31 or 72<br><br>Table: ecommerce.sessions <br>Description: sessions of usage the online shop<br>Fields: <br>- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004<br>- session_id (integer) - unique identifier of session, for example, 106 or 1023<br>- action_date (date) - session start date, for example, \"2021-01-03\" or \"2024-12-02\"<br>- session_duration (integer) - duration of session in seconds, for example, 125 or 49<br>- os (string) - operation system that customer used, for example, \"Windows\" or \"Android\"<br>- browser (string) - browser that customer used, for example, \"Chrome\" or \"Safari\"<br>- is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise<br>- revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7<br><br><br>Write a query in ClickHouse SQL to answer the following question. <br>Add \"format TabSeparatedWithNames\" at the end of the query to get data from ClickHouse database in the right format. <br>'''<\/pre>\n<p>I\u2019ve included the example values for each field in the prompt to ensure that LLM understands the data\u00a0format.<\/p>\n<p>And that\u2019s it! With this, we have our first functional prototype for the SQL Agent. Now, it\u2019s time to put it to the\u00a0test.<\/p>\n<pre>prompt = get_llama_prompt('How many customers made purchase in December 2024?', <br>  generate_query_system_prompt)<br>llm.invoke(prompt)<br><br># To find out how many customers made a purchase in December 2024, <br># we need to join the `sessions` table with the `users` table on the `user_id`, <br># and then filter by `action_date` for sessions that took place in December 2024. <br><br># Here is the query:<br># <br># ```sql<br># SELECT COUNT(DISTINCT u.user_id)<br># FROM ecommerce.sessions s<br># JOIN ecommerce.users u ON s.user_id = u.user_id<br># WHERE <br>#     EXTRACT(YEAR FROM s.action_date) = 2024<br>#     AND EXTRACT(MONTH FROM s.action_date) = 12<br>#     AND revenue &gt; 0;<br># ```<br><br># This query first joins the `sessions` table with the `users` table <br># on the `user_id`. Then it filters out sessions where no purchase <br># was made (revenue is greater than 0). Finally, it selects the distinct <br># user IDs of these customers.<br><br># When you run this query in ClickHouse SQL, you can add \"format TabSeparatedWithNames\" <br># at the end to get the result in the right format:<br><br># ```sql<br># SELECT COUNT(DISTINCT u.user_id)<br># FROM ecommerce.sessions s<br># JOIN ecommerce.users u ON s.user_id = u.user_id<br># WHERE <br>#     EXTRACT(YEAR FROM s.action_date) = 2024<br>#     AND EXTRACT(MONTH FROM s.action_date) = 12<br>#     AND revenue &gt; 0;<br># format TabSeparatedWithNames;<br># ```<\/pre>\n<p>The agent produced a fairly decent result, but there\u2019s one issue\u200a\u2014\u200athe LLM returned not only the SQL query but also some commentary. Since we plan to execute SQL queries later, this format is not suitable for our task. Let\u2019s work on fixing\u00a0it.<\/p>\n<p>Fortunately, this problem has already been solved, and we don\u2019t need to parse the SQL queries from the text manually. We can use the chat model <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/ollama\/\">ChatOllama<\/a>. Unfortunately, it doesn\u2019t support structured output, but we can leverage tool calling to achieve the same\u00a0result.<\/p>\n<p>To do this, we will define a dummy tool to execute the query and instruct the model in the system prompt always to call this tool. I\u2019ve kept the comments in the output to give the model some space for reasoning, following the chain-of-thought pattern.<\/p>\n<pre>from langchain_ollama import ChatOllama<br>from langchain_core.tools import tool<br><br>@tool<br>def execute_query(comments: str, query: str) -&gt; str:<br>  \"\"\"Excutes SQL query.<br><br>  Args:<br>      comments (str): 1-2 sentences describing the result SQL query <br>          and what it does to answer the question,<br>      query (str): SQL query<br>  \"\"\"<br>  pass <br><br>chat_llm = ChatOllama(model=\"llama3.1:8b\").bind_tools([execute_query])<br>result = chat_llm.invoke(prompt)<br>print(result.tool_calls)<br><br># [{'name': 'execute_query',<br>#   'args': {'comments': 'SQL query returns number of customers who made a purchase in December 2024. The query joins the sessions and users tables based on user ID to filter out inactive customers and find those with non-zero revenue in December 2024.',<br>#   'query': 'SELECT COUNT(DISTINCT T2.user_id) FROM ecommerce.sessions AS T1 INNER JOIN ecommerce.users AS T2 ON T1.user_id = T2.user_id WHERE YEAR(T1.action_date) = 2024 AND MONTH(T1.action_date) = 12 AND T2.is_active = 1 AND T1.revenue &gt; 0'},<br>#   'type': 'tool_call'}]<\/pre>\n<p>With the tool calling, we can now get the SQL query directly from the model. That\u2019s an excellent result. However, the generated query is not entirely accurate:<\/p>\n<ul>\n<li>It includes a filter for is_active = 1, even though we didn\u2019t specify the need to filter out inactive customers.<\/li>\n<li>The LLM missed specifying the format despite our explicit request in the system\u00a0prompt.<\/li>\n<\/ul>\n<p>Clearly, we need to focus on improving the model\u2019s accuracy. But as Peter Drucker famously said, <em>\u201cYou can\u2019t improve what you don\u2019t measure.\u201d <\/em>So, the next logical step is to build a system for evaluating the model\u2019s quality. This system will be a cornerstone for performance improvement iterations. Without it, we\u2019d essentially be navigating in the\u00a0dark.<\/p>\n<h3>Evaluating the\u00a0accuracy<\/h3>\n<h4>Evaluation basics<\/h4>\n<p>To ensure we\u2019re improving, we need a robust way to measure accuracy. The most common approach is to create a \u201cgolden\u201d evaluation set with questions and correct answers. Then, we can compare the model\u2019s output with these \u201cgolden\u201d answers and calculate the share of correct ones. While this approach sounds simple, there are a few nuances worth discussing.<\/p>\n<p>First, you might feel overwhelmed at the thought of creating a comprehensive set of questions and answers. Building such a dataset can seem like a daunting task, potentially requiring weeks or months. However, we can start small by creating an initial set of 20\u201350 examples and iterating on\u00a0it.<\/p>\n<p>As always, quality is more important than quantity. Our goal is to create a representative and diverse dataset. Ideally, this should\u00a0include:<\/p>\n<ul>\n<li>\n<strong>Common questions. <\/strong>In most real-life cases, we can take the history of actual questions and use it as our initial evaluation set.<\/li>\n<li>\n<strong>Challenging edge cases.<\/strong> It\u2019s worth adding examples where the model tends to hallucinate. You can find such cases either while experimenting yourself or by gathering feedback from the first prototype.<\/li>\n<\/ul>\n<p>Once the dataset is ready, the next challenge is how to score the generated results. We can consider several approaches:<\/p>\n<ul>\n<li>\n<strong>Comparing SQL queries.<\/strong> The first idea is to compare the generated SQL query with the one in the evaluation set. However, it might be tricky. Similarly-looking queries can yield completely different results. At the same time, queries that look different can lead to the same conclusions. Additionally, simply comparing SQL queries doesn\u2019t verify whether the generated query is actually executable. Given these challenges, I wouldn\u2019t consider this approach the most reliable solution for our\u00a0case.<\/li>\n<li>\n<strong>Exact matches. <\/strong>We can use old-school exact matching when answers in our evaluation set are deterministic. For example, if the question is, \u201cHow many customers are there?\u201d and the answer is \u201c592800\u201d, the model\u2019s response must match precisely. However, this approach has its limitations. Consider the example above, and the model responds, <em>\u201cThere are 592,800 customers\u201d<\/em>. While the answer is absolutely correct, an exact match approach would flag it as\u00a0invalid.<\/li>\n<li>\n<strong>Using LLMs for scoring.<\/strong> A more robust and flexible approach is to leverage LLMs for evaluation. Instead of focusing on query structure, we can ask the LLM to compare the results of SQL executions. This method is particularly effective in cases where the query might differ but still yields correct\u00a0outputs.<\/li>\n<\/ul>\n<p>It\u2019s worth keeping in mind that evaluation isn\u2019t a one-time task; it\u2019s a continuous process. To push our model\u2019s performance further, we need to expand the dataset with examples causing the model\u2019s hallucinations. In production mode, we can create a feedback loop. By gathering input from users, we can identify cases where the model fails and include them in our evaluation set.<\/p>\n<p>In our example, we will be assessing only whether the result of execution is valid (SQL query can be executed) and correct. Still, you can look at other parameters as well. For example, if you care about efficiency, you can compare the execution times of generated queries against those in the golden\u00a0set.<\/p>\n<h4>Evaluation set and validation<\/h4>\n<p>Now that we\u2019ve covered the basics, we\u2019re ready to put them into practice. I spent about 20 minutes putting together a set of 10 examples. While small, this set is sufficient for our toy task. It consists of a list of questions paired with their corresponding SQL queries, like\u00a0this:<\/p>\n<pre>[<br>  {<br>    \"question\": \"How many customers made purchase in December 2024?\",<br>    \"sql_query\": \"select uniqExact(user_id) as customers from ecommerce.sessions where (toStartOfMonth(action_date) = '2024-12-01') and (revenue &gt; 0) format TabSeparatedWithNames\"<br>  },<br>  {<br>    \"question\": \"What was the fraud rate in 2023, expressed as a percentage?\",<br>    \"sql_query\": \"select 100*uniqExactIf(user_id, is_fraud = 1)\/uniqExact(user_id) as fraud_rate from ecommerce.sessions where (toStartOfYear(action_date) = '2023-01-01') format TabSeparatedWithNames\"<br>  },<br>  ...<br>]<\/pre>\n<blockquote><p>You can find the full list on GitHub\u200a\u2014\u200a<a href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/sql_agent_accuracy\/golden_set.json\">link<\/a>.<\/p><\/blockquote>\n<p>We can load the dataset into a DataFrame, making it ready for use in the\u00a0code.<\/p>\n<pre>import json<br>with open('golden_set.json', 'r') as f:<br>  golden_set = json.loads(f.read())<br><br>golden_df = pd.DataFrame(golden_set) <br>golden_df['id'] = list(range(golden_df.shape[0]))<\/pre>\n<p>First, let\u2019s generate the SQL queries for each question in the evaluation set.<\/p>\n<pre>def generate_query(question):<br>  prompt = get_llama_prompt(question, generate_query_system_prompt)<br>  result = chat_llm.invoke(prompt)<br>  try:<br>    generated_query = result.tool_calls[0]['args']['query']<br>  except:<br>    generated_query = ''<br>  return generated_query<br><br>import tqdm<br><br>tmp = []<br>for rec in tqdm.tqdm(golden_df.to_dict('records')):<br>  generated_query = generate_query(rec['question'])<br>  tmp.append(<br>    {<br>      'id': rec['id'],<br>      'generated_query': generated_query<br>    }<br>  )<br><br>eval_df = golden_df.merge(pd.DataFrame(tmp))<\/pre>\n<p>Before moving on to the LLM-based scoring of query outputs, it\u2019s important to first ensure that the SQL query is valid. To do this, we need to execute the queries and examine the database\u00a0output.<\/p>\n<p>I\u2019ve created a function that runs a query in ClickHouse. It also ensures that the output format is correctly specified, as this may be critical in business applications.<\/p>\n<pre>CH_HOST = 'http:\/\/localhost:8123' # default address <br>import requests<br>import io<br><br>def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):<br>  # pushing model to return data in the format that we want<br>  if not 'format tabseparatedwithnames' in query.lower():<br>    return \"Database returned the following error:n Please, specify the output format.\"<br>    <br>  r = requests.post(host, params = {'query': query}, <br>    timeout = connection_timeout)<br>  if r.status_code == 200:<br>    return r.text<br>  else: <br>    return 'Database returned the following error:n' + r.text<br>    # giving feedback to LLM instead of raising exception<\/pre>\n<p>The next step is to execute both the generated and golden queries and then save their\u00a0outputs.<\/p>\n<pre>tmp = []<br><br>for rec in tqdm.tqdm(eval_df.to_dict('records')):<br>  golden_output = get_clickhouse_data(rec['sql_query'])<br>  generated_output = get_clickhouse_data(rec['generated_query'])<br><br>  tmp.append(<br>    {<br>      'id': rec['id'],<br>      'golden_output': golden_output,<br>      'generated_output': generated_output<br>    }<br>  )<br><br>eval_df = eval_df.merge(pd.DataFrame(tmp))<\/pre>\n<p>Next, let\u2019s check the output to see whether the SQL query is valid or\u00a0not.<\/p>\n<pre>def is_valid_output(s):<br>  if s.startswith('Database returned the following error:'):<br>    return 'error'<br>  if len(s.strip().split('n')) &gt;= 1000:<br>    return 'too many rows'<br>  return 'ok'<br><br>eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)<br>eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)<\/pre>\n<p>Then, we can evaluate the SQL validity for both the golden and generated sets.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AiRw8I6N5aqfEsl2lt1x3Ew.png?ssl=1\"><\/figure>\n<p>The initial results are not very promising; the LLM was unable to generate even a single valid query. Looking at the errors, it\u2019s clear that the model failed to specify the right format despite it being explicitly defined in the system prompt. So, we definitely need to work more on the accuracy.<\/p>\n<h4>Checking the correctness<\/h4>\n<p>However, validity alone is not enough. It\u2019s crucial that we not only generate valid SQL queries but also produce the correct results. Although we already know that all our queries are invalid, let\u2019s now incorporate output evaluation into our\u00a0process.<\/p>\n<p>As discussed, we will use LLMs to compare the outputs of the SQL queries. I typically prefer using more powerful model for evaluation, following the day-to-day logic where a senior team member reviews the work. For this task, I\u2019ve chosen <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/openai\/\">OpenAI GPT\u00a04o-mini<\/a>.<\/p>\n<p>Similar to our generation flow, I\u2019ve set up all the building blocks necessary for accuracy assessment.<\/p>\n<pre>from langchain_openai import ChatOpenAI<br><br>accuracy_system_prompt = '''<br>You are a senior and very diligent QA specialist and your task is to compare data in datasets. <br>They are similar if they are almost identical, or if they convey the same information. <br>Disregard if column names specified in the first row have different names or in a different order.<br>Focus on comparing the actual information (numbers). If values in datasets are different, then it means that they are not identical.<br>Always execute tool to provide results.<br>'''<br><br>@tool<br>def compare_datasets(comments: str, score: int) -&gt; str:<br>  \"\"\"Stores info about datasets.<br>  Args:<br>      comments (str): 1-2 sentences about the comparison of datasets,<br>      score (int): 0 if dataset provides different values and 1 if it shows identical information<br>  \"\"\"<br>  pass<br><br>accuracy_chat_llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature = 0.0)<br>  .bind_tools([compare_datasets])<br><br>accuracy_question_tmp = '''<br>Here are the two datasets to compare delimited by ####<br>Dataset #1: <br>####<br>{dataset1}<br>####<br>Dataset #2: <br>####<br>{dataset2}<br>####<br>'''<br><br>def get_openai_prompt(question, system):<br>  messages = [<br>    (\"system\", system),<br>    (\"human\", question)<br>  ]<br>  return messages<\/pre>\n<p>Now, it\u2019s time to test the accuracy assessment process.<\/p>\n<pre>prompt = get_openai_prompt(accuracy_question_tmp.format(<br>  dataset1 = 'customersn114032n', dataset2 = 'customersn114031n'),<br>  accuracy_system_prompt)<br><br>accuracy_result = accuracy_chat_llm.invoke(prompt)<br>accuracy_result.tool_calls[0]['args']<br># {'comments': 'The datasets contain different customer counts: 114032 in Dataset #1 and 114031 in Dataset #2.',<br>#  'score': 0}<br><br>prompt = get_openai_prompt(accuracy_question_tmp.format(<br>  dataset1 = 'usersn114032n', dataset2 = 'customersn114032n'),<br>  accuracy_system_prompt)<br>accuracy_result = accuracy_chat_llm.invoke(prompt)<br>accuracy_result.tool_calls[0]['args']<br># {'comments': 'The datasets contain the same numerical value (114032) despite different column names, indicating they convey identical information.',<br>#  'score': 1}<\/pre>\n<p>Fantastic! It looks like everything is working as expected. Let\u2019s now encapsulate this into a function.<\/p>\n<pre>def is_answer_accurate(output1, output2):<br>  prompt = get_openai_prompt(<br>    accuracy_question_tmp.format(dataset1 = output1, dataset2 = output2),<br>    accuracy_system_prompt<br>  )<br>  <br>  accuracy_result = accuracy_chat_llm.invoke(prompt)<br>  <br>  try:<br>    return accuracy_result.tool_calls[0]['args']['score']<br>  except:<br>    return None<\/pre>\n<h4>Putting the evaluation approach\u00a0together<\/h4>\n<p>As we discussed, building an LLM application is an iterative process, so we\u2019ll need to run our accuracy assessment multiple times. It will be helpful to have all this logic encapsulated in a single function.<\/p>\n<p>The function will take two arguments as\u00a0input:<\/p>\n<ul>\n<li>generate_query_func: a function that generates an SQL query for a given question.<\/li>\n<li>golden_df: an evaluation dataset with questions and correct answers in the form of a pandas DataFrame.<\/li>\n<\/ul>\n<p>As output, the function will return a DataFrame with all evaluation results and a couple of charts displaying the main\u00a0KPIs.<\/p>\n<pre><br>def evaluate_sql_agent(generate_query_func, golden_df):<br>  <br>  # generating SQL<br>  tmp = []<br>  for rec in tqdm.tqdm(golden_df.to_dict('records')):<br>    generated_query = generate_query_func(rec['question'])<br>    tmp.append(<br>      {<br>          'id': rec['id'],<br>          'generated_query': generated_query<br>      }<br>    )<br><br>  eval_df = golden_df.merge(pd.DataFrame(tmp))<br><br>  # executing SQL queries<br>  tmp = []<br>  for rec in tqdm.tqdm(eval_df.to_dict('records')):<br>    golden_output = get_clickhouse_data(rec['sql_query'])<br>    generated_output = get_clickhouse_data(rec['generated_query'])<br><br>    tmp.append(<br>      {<br>        'id': rec['id'],<br>        'golden_output': golden_output,<br>        'generated_output': generated_output<br>      }<br>    )<br><br>  eval_df = eval_df.merge(pd.DataFrame(tmp))<br><br>  # checking accuracy<br>  eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)<br>  eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)<br>  <br>  eval_df['correct_output'] = list(map(<br>    is_answer_accurate,<br>    eval_df['golden_output'],<br>    eval_df['generated_output']<br>  ))<br><br>  eval_df['accuracy'] = list(map(<br>    lambda x, y: 'invalid: ' + x if x != 'ok' else ('correct' if y == 1 else 'incorrect'),<br>    eval_df.generated_output_valid,<br>    eval_df.correct_output<br>  ))<br><br>  valid_stats_df = (eval_df.groupby('golden_output_valid')[['id']].count().rename(columns = {'id': 'golden set'}).join(<br>    eval_df.groupby('generated_output_valid')[['id']].count().rename(columns = {'id': 'generated'}), how = 'outer')).fillna(0).T<br><br>  fig1 = px.bar(<br>    valid_stats_df.apply(lambda x: 100*x\/valid_stats_df.sum(axis = 1)),<br>    orientation = 'h', <br>    title = '&lt;b&gt;LLM SQL Agent evaluation&lt;\/b&gt;: query validity',<br>    text_auto = '.1f',<br>    color_discrete_map = {'ok': '#00b38a', 'error': '#ea324c', 'too many rows': '#f2ac42'},<br>    labels = {'index': '', 'variable': 'validity', 'value': 'share of queries, %'}<br>  )<br>  fig1.show()<br><br>  accuracy_stats_df = eval_df.groupby('accuracy')[['id']].count()<br>  accuracy_stats_df['share'] = accuracy_stats_df.id*100\/accuracy_stats_df.id.sum()<br><br>  fig2 = px.bar(<br>    accuracy_stats_df[['share']],<br>    title = '&lt;b&gt;LLM SQL Agent evaluation&lt;\/b&gt;: query accuracy',<br>    text_auto = '.1f', orientation = 'h',<br>    color_discrete_sequence = ['#0077B5'],<br>    labels = {'index': '', 'variable': 'accuracy', 'value': 'share of queries, %'}<br>  )<br><br>  fig2.update_layout(showlegend = False)<br>  fig2.show()<br><br>  return eval_df<\/pre>\n<p>With that, we\u2019ve completed the evaluation setup and can now move on to the core task of improving the model\u2019s accuracy.<\/p>\n<h3>Improving accuracy: Self-reflection<\/h3>\n<p>Let\u2019s do a quick recap. We\u2019ve built and tested the first version of SQL Agent. Unfortunately, all generated queries were invalid because they were missing the output format. Let\u2019s address this\u00a0issue.<\/p>\n<p>One potential solution is self-reflection. We can make an additional call to the LLM, sharing the error and asking it to correct the bug. Let\u2019s create a function to handle generation with self-reflection.<\/p>\n<pre>reflection_user_query_tmpl = '''<br>You've got the following question: \"{question}\". <br>You've generated the SQL query: \"{query}\".<br>However, the database returned an error: \"{output}\". <br>Please, revise the query to correct mistake. <br>'''<br><br>def generate_query_reflection(question):<br>  generated_query = generate_query(question) <br>  print('Initial query:', generated_query)<br>  <br>  db_output = get_clickhouse_data(generated_query)<br>  is_valid_db_output = is_valid_output(db_output)<br>  if is_valid_db_output == 'too many rows':<br>    db_output = \"Database unexpectedly returned more than 1000 rows.\"<br><br>  if is_valid_db_output == 'ok': <br>    return generated_query<br><br>  reflection_user_query = reflection_user_query_tmpl.format(<br>    question = question,<br>    query = generated_query,<br>    output = db_output<br>  )<br>    <br>  reflection_prompt = get_llama_prompt(reflection_user_query, <br>    generate_query_system_prompt) <br>  reflection_result = chat_llm.invoke(reflection_prompt)<br><br>  try:<br>    reflected_query = reflection_result.tool_calls[0]['args']['query']<br>  except:<br>    reflected_query = ''<br>  print('Reflected query:', reflected_query)<br>  return reflected_query<\/pre>\n<p>Now, let\u2019s use our evaluation function to check whether the quality has improved. Assessing the next iteration has become effortless.<\/p>\n<pre>refl_eval_df = evaluate_sql_agent(generate_query_reflection, golden_df)<\/pre>\n<p>Wonderful! We\u2019ve achieved better results\u200a\u2014\u200a50% of the queries are now valid, and all format issues have been resolved. So, self-reflection is pretty effective.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A4dFVkTnWy-hCsRGrr1_XWA.png?ssl=1\"><\/figure>\n<p>However, self-reflection has its limitations. When we examine the accuracy, we see that the model returns the correct answer for only one question. So, our journey is not over\u00a0yet.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ag7MxRb1n8G0HK1ZQaLT7Uw.png?ssl=1\"><\/figure>\n<h3>Improving accuracy: RAG<\/h3>\n<p>Another approach to improving accuracy is using RAG (retrieval-augmented generation). The idea is to identify question-and-answer pairs similar to the customer query and include them in the system prompt, enabling the LLM to generate a more accurate response.<\/p>\n<p>RAG consists of the following stages:<\/p>\n<ul>\n<li>\n<strong>Loading documents: <\/strong>importing data from available sources.<\/li>\n<li>\n<strong>Splitting documents: <\/strong>creating smaller\u00a0chunks.<\/li>\n<li>\n<strong>Storage: <\/strong>using vector stores to process and store data efficiently.<\/li>\n<li>\n<strong>Retrieval: <\/strong>extracting documents that are relevant to the\u00a0query.<\/li>\n<li>\n<strong>Generation: <\/strong>passing a question and relevant documents to LLM to generate the final\u00a0answer<strong>.<\/strong>\n<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/0%2AvZIJZP_M1gWwS0Ie.png?ssl=1\"><\/figure>\n<blockquote><p>If you\u2019d like a refresher on RAG, you can check out my previous article, <a href=\"https:\/\/medium.com\/towards-data-science\/rag-how-to-talk-to-your-data-eaf5469b83b0\">\u201cRAG: How to Talk to Your\u00a0Data.\u201d<\/a>\n<\/p><\/blockquote>\n<p>We will use the Chroma database as a local vector storage\u200a\u2014\u200ato store and retrieve embeddings.<\/p>\n<pre>from langchain_chroma import Chroma<br>vector_store = Chroma(embedding_function=embeddings)<\/pre>\n<p>Vector stores are using embeddings to find chunks that are similar to the query. For this purpose, we will use OpenAI embeddings.<\/p>\n<pre>from langchain_openai import OpenAIEmbeddings<br>embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")<\/pre>\n<p>Since we can\u2019t use examples from our evaluation set (as they are already being used to assess quality), I\u2019ve created a separate set of question-and-answer pairs for RAG. You can find it on\u00a0<a href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/sql_agent_accuracy\/rag_set.json\">GitHub<\/a>.<\/p>\n<p>Now, let\u2019s load the set and create a list of pairs in the following format: Question: %s; Answer:\u00a0%s.<\/p>\n<pre>with open('rag_set.json', 'r') as f:<br>    rag_set = json.loads(f.read())<br>rag_set_df = pd.DataFrame(rag_set)<br><br>rag_set_df['formatted_txt'] = list(map(<br>    lambda x, y: 'Question: %s; Answer: %s' % (x, y),<br>    rag_set_df.question,<br>    rag_set_df.sql_query<br>))<br><br>rag_string_data = 'nn'.join(rag_set_df.formatted_txt)<\/pre>\n<p>Next, I used LangChain\u2019s text splitter by character to create chunks, with each question-and-answer pair as a separate chunk. Since we are splitting the text semantically, no overlap is necessary.<\/p>\n<pre>from langchain_text_splitters import CharacterTextSplitter<br><br>text_splitter = CharacterTextSplitter(<br>    separator=\"nn\",<br>    chunk_size=1, # to split by character without merging<br>    chunk_overlap=0,<br>    length_function=len,<br>    is_separator_regex=False,<br>)<br><br>texts = text_splitter.create_documents([rag_string_data])<\/pre>\n<p>The final step is to load the chunks into our vector\u00a0storage.<\/p>\n<pre>document_ids = vector_store.add_documents(documents=texts)<br>print(vector_store._collection.count())<br># 32<\/pre>\n<p>Now, we can test the retrieval to see the results. They look quite similar to the customer question.<\/p>\n<pre>question = 'What was the share of users using Windows yesterday?'<br>retrieved_docs = vector_store.similarity_search(question, 3)<br>context = \"nn\".join(map(lambda x: x.page_content, retrieved_docs))<br>print(context)<br><br># Question: What was the share of users using Windows the day before yesterday?; <br># Answer: select 100*uniqExactIf(user_id, os = 'Windows')\/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date = today() - 2) format TabSeparatedWithNames<br># Question: What was the share of users using Windows in the last week?; <br># Answer: select 100*uniqExactIf(user_id, os = 'Windows')\/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date &gt;= today() - 7) and (action_date &lt; today()) format TabSeparatedWithNames<br># Question: What was the share of users using Android yesterday?; <br># Answer: select 100*uniqExactIf(user_id, os = 'Android')\/uniqExact(user_id) as android_share from ecommerce.sessions where (action_date = today() - 1) format TabSeparatedWithNames<\/pre>\n<p>Let\u2019s adjust the system prompt to include the examples we retrieved.<\/p>\n<pre>generate_query_system_prompt_with_examples_tmpl = '''<br>You are a senior data analyst with more than 10 years of experience writing complex SQL queries. <br>There are two tables in the database you're working with with the following schemas. <br><br>Table: ecommerce.users <br>Description: customers of the online shop<br>Fields: <br>- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004<br>- country (string) - country of residence, for example, \"Netherlands\" or \"United Kingdom\"<br>- is_active (integer) - 1 if customer is still active and 0 otherwise<br>- age (integer) - customer age in full years, for example, 31 or 72<br><br>Table: ecommerce.sessions <br>Description: sessions of usage the online shop<br>Fields: <br>- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004<br>- session_id (integer) - unique identifier of session, for example, 106 or 1023<br>- action_date (date) - session start date, for example, \"2021-01-03\" or \"2024-12-02\"<br>- session_duration (integer) - duration of session in seconds, for example, 125 or 49<br>- os (string) - operation system that customer used, for example, \"Windows\" or \"Android\"<br>- browser (string) - browser that customer used, for example, \"Chrome\" or \"Safari\"<br>- is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise<br>- revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7<br><br><br>Write a query in ClickHouse SQL to answer the following question. <br>Add \"format TabSeparatedWithNames\" at the end of the query to get data from ClickHouse database in the right format. <br>Answer questions following the instructions and providing all the needed information and sharing your reasoning. <br><br>Examples of questions and answers: <br>{examples}<br>'''<\/pre>\n<p>Once again, let\u2019s create the generate query function with\u00a0RAG.<\/p>\n<pre>def generate_query_rag(question):<br>  retrieved_docs = vector_store.similarity_search(question, 3)<br>  context = context = \"nn\".join(map(lambda x: x.page_content, retrieved_docs))<br>  <br>  prompt = get_llama_prompt(question, <br>    generate_query_system_prompt_with_examples_tmpl.format(examples = context))<br>  result = chat_llm.invoke(prompt)<br>  <br>  try:<br>    generated_query = result.tool_calls[0]['args']['query']<br>  except:<br>    generated_query = ''<br>  return generated_query<\/pre>\n<p>As usual, let\u2019s use our evaluation function to test the new approach.<\/p>\n<pre>rag_eval_df = evaluate_sql_agent(generate_query_rag, golden_df)<\/pre>\n<p>We can see a significant improvement, increasing from 1 to 6 correct answers out of 10. It\u2019s still not ideal, but we\u2019re moving in the right direction.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Az4p7j76ldB5B9RJUqgQu_g.png?ssl=1\"><\/figure>\n<p>We can also experiment with combining two approaches: RAG and self-reflection.<\/p>\n<pre>def generate_query_rag_with_reflection(question):<br>  generated_query = generate_query_rag(question) <br>  <br>  db_output = get_clickhouse_data(generated_query)<br>  is_valid_db_output = is_valid_output(db_output)<br>  if is_valid_db_output == 'too many rows':<br>      db_output = \"Database unexpectedly returned more than 1000 rows.\"<br><br>  if is_valid_db_output == 'ok': <br>      return generated_query<br><br>  reflection_user_query = reflection_user_query_tmpl.format(<br>    question = question,<br>    query = generated_query,<br>    output = db_output<br>  )<br>  <br>  reflection_prompt = get_llama_prompt(reflection_user_query, generate_query_system_prompt) <br>  reflection_result = chat_llm.invoke(reflection_prompt)<br><br>  try:<br>    reflected_query = reflection_result.tool_calls[0]['args']['query']<br>  except:<br>    reflected_query = ''<br>  return reflected_query<br><br>rag_refl_eval_df = evaluate_sql_agent(generate_query_rag_with_reflection, <br>  golden_df)<\/pre>\n<p>We can see another slight improvement: we\u2019ve completely eliminated invalid SQL queries (thanks to self-reflection) and increased the number of correct answers to 7 out of\u00a010.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AbVjppjpB1L6Fd9rr1AmDQA.png?ssl=1\"><\/figure>\n<p>That\u2019s it. It\u2019s been quite a journey. We started with 0 valid SQL queries and have now achieved 70% accuracy.<\/p>\n<blockquote><p>You can find the complete code on\u00a0<a href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/sql_agent_accuracy\/sql_agent_poc.ipynb\">GitHub<\/a>.<\/p><\/blockquote>\n<h3>Summary<\/h3>\n<p>In this article, we explored the iterative process of improving accuracy for LLM applications.<\/p>\n<ul>\n<li>We built an evaluation set and the scoring criteria that allowed us to compare different iterations and understand whether we were moving in the right direction.<\/li>\n<li>We leveraged self-reflection to allow the LLM to correct its mistakes and significantly reduce the number of invalid SQL\u00a0queries.<\/li>\n<li>Additionally, we implemented Retrieval-Augmented Generation (RAG) to further enhance the quality, achieving an accuracy rate of\u00a060\u201370%.<\/li>\n<\/ul>\n<p>While this is a solid result, it still falls short of the 90%+ accuracy threshold typically expected for production applications. To achieve such a high bar, we need to use fine-tuning, which will be the topic of the next\u00a0article.<\/p>\n<blockquote><p>Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments\u00a0section.<\/p><\/blockquote>\n<h3>Reference<\/h3>\n<p><em>All the images are produced by the author unless otherwise stated.<\/em><\/p>\n<p>This article is inspired by the <a href=\"https:\/\/www.deeplearning.ai\/short-courses\/improving-accuracy-of-llm-applications\/\">\u201cImproving Accuracy of LLM Applications\u201d<\/a> short course from DeepLearning.AI.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=791d79b0af9b\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b\">From Prototype to Production: Enhancing LLM Accuracy<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Mariya Mansurova<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Ffrom-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>From Prototype to Production: Enhancing LLM Accuracy Implementing evaluation frameworks to optimize accuracy in real-world applications Image created by DALL-E\u00a03 Building a prototype for an LLM application is surprisingly straightforward. You can often create a functional first version within just a few hours. This initial prototype will likely provide results that look legitimate and be [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,240,87,404],"tags":[807,134,806],"class_list":["post-702","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-editors-pick","category-llm","category-sql","tag-accuracy","tag-llm","tag-prototype"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/702"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=702"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/702\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=702"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=702"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=702"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}