{"id":1821,"date":"2025-02-13T07:02:22","date_gmt":"2025-02-13T07:02:22","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/13\/how-to-measure-the-reliability-of-a-large-language-models-response\/"},"modified":"2025-02-13T07:02:22","modified_gmt":"2025-02-13T07:02:22","slug":"how-to-measure-the-reliability-of-a-large-language-models-response","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/13\/how-to-measure-the-reliability-of-a-large-language-models-response\/","title":{"rendered":"How to Measure the Reliability of a Large Language Model\u2019s Response"},"content":{"rendered":"<p>    How to Measure the Reliability of a Large Language Model\u2019s Response<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, idea generation, brainstorming, code generation, information processing, and content creation. That said, LLMs do not have any memory no do they actually \u201cunderstand\u201d anything, other than sticking to their basic function: <em>predicting the next word<\/em>.<\/p>\n<p class=\"wp-block-paragraph\">The process of next-word prediction is probabilistic. The LLM has to select each word from a probability distribution. In the process, they often generate false, fabricated, or inconsistent content in an attempt to produce coherent responses and fill in gaps with plausible-looking but incorrect information. This phenomenon is called hallucination, an inevitable, well-known feature of LLMs that warrants validation and corroboration of their outputs.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Retrieval augment generation (RAG) methods, which make an LLM work with external knowledge sources, do minimize hallucinations to some extent, but they cannot completely eradicate them. Although advanced RAGs can provide in-text citations and URLs, verifying these references could be hectic and time-consuming. Therefore, we need an objective criterion for assessing the reliability or trustworthiness of an LLM\u2019s response, whether it is generated from its own knowledge or an external knowledge base (RAG).\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In this article, we will discuss how the output of an LLM can be assessed for trustworthiness by a trustworthy language model which assigns a score to the LLM\u2019s output. We will first discuss how we can use a trustworthy language model to assign scores to an LLM\u2019s answer and explain trustworthiness. Subsequently, we will develop an example RAG with LlamaParse and <a href=\"https:\/\/towardsdatascience.com\/tag\/llamaindex\/\" title=\"Llamaindex\">Llamaindex<\/a> that assesses the RAG\u2019s answers for trustworthiness.<\/p>\n<p class=\"wp-block-paragraph\">The entire code of this article is available in the jupyter notebook on <a href=\"https:\/\/github.com\/umairalipathan1980\/Trustworthy-LLM\">GitHub<\/a>.\u00a0<\/p>\n<h2 class=\"wp-block-heading\"><strong>Assigning a Trustworthiness Score to an LLM\u2019s Answer<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">To demonstrate how we can assign a trustworthiness score to an <a href=\"https:\/\/towardsdatascience.com\/tag\/llm\/\" title=\"Llm\">Llm<\/a>\u2019s response, I will use <a href=\"https:\/\/cleanlab.ai\/blog\/trustworthy-language-model\/\">Cleanlab\u2019s Trustworthy Language Model (TLM)<\/a>. Such TLMs use a combination of <strong>uncertainty quantification<\/strong> and <strong>consistency analysis<\/strong> to compute trustworthiness scores and explanations for LLM responses.<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/cleanlab.ai\/\">Cleanlab<\/a> offers free trial APIs which can be obtained by creating an account at their website. We first need to install Cleanlab\u2019s Python client:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">pip install --upgrade cleanlab-studio<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Cleanlab supports several proprietary models such as \u2018<em>gpt-4o<\/em>\u2019, \u2018<em>gpt-4o-mini<\/em>\u2019, \u2018<em>o1-preview<\/em>\u2019, \u2018<em>claude-3-sonnet<\/em>\u2019, \u2018<em>claude-3.5-sonnet<\/em>\u2019, \u2018<em>claude-3.5-sonnet-v2<\/em>\u2019 and others. Here is how TLM assigns a trustworhiness score to gpt-4o\u2019s answer. The trustworthiness score ranges from 0 to 1, where higher values indicate greater trustworthiness.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from cleanlab_studio import Studio\nstudio = Studio(\"&lt;CLEANLAB_API_KEY&gt;\")  # Get your API key from above\ntlm = studio.TLM(options={\"log\": [\"explanation\"], \"model\": \"gpt-4o\"}) # GPT, Claude, etc\n#set the prompt\nout = tlm.prompt(\"How many vowels are there in the word 'Abracadabra'.?\")\n#the TLM response contains the actual output 'response', trustworthiness score and explanation\nprint(f\"Model's response = {out['response']}\")\nprint(f\"Trustworthiness score = {out['trustworthiness_score']}\")\nprint(f\"Explanation = {out['log']['explanation']}\")\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The above code tested the response of gpt-4o for the question \u201c<em>How many vowels are there in the word \u2018Abracadabra\u2019.?<\/em>\u201d. The TLM\u2019s output contains the model\u2019s answer (response), trustworthiness score, and explanation. Here is the output of this code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">Model's response = The word \"Abracadabra\" contains 6 vowels. The vowels are: A, a, a, a, a, and a.\nTrustworthiness score = 0.6842228802750124\nExplanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):\n5.\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">It can be seen how the most advanced language model hallucinates for such simple tasks and produces the wrong output.\u00a0Here is the response and trustworthiness score for the same question for <em>claude-3.5-sonnet-v2<\/em>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">Model's response = Let me count the vowels in 'Abracadabra':\nA-b-r-a-c-a-d-a-b-r-a\n\nThe vowels are: A, a, a, a, a\n\nThere are 5 vowels in the word 'Abracadabra'.\nTrustworthiness score = 0.9378276048845285\nExplanation = Did not find a reason to doubt trustworthiness.\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><em>claude-3.5-sonnet-v2<\/em> produces the correct output. Let\u2019s compare the two models\u2019 responses to another question.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from cleanlab_studio import Studio\nimport markdown\nfrom IPython.core.display import display, Markdown\n\n# Initialize the Cleanlab Studio with API key\nstudio = Studio(\"&lt;CLEANLAB_API_KEY&gt;\")  # Replace with your actual API key\n\n# List of models to evaluate\nmodels = [\"gpt-4o\", \"claude-3.5-sonnet-v2\"]\n\n# Define the prompt\nprompt_text = \"Which one of 9.11 and 9.9 is bigger?\"\n\n# Loop through each model and evaluate\nfor model in models:\n   tlm = studio.TLM(options={\"log\": [\"explanation\"], \"model\": model})\n   out = tlm.prompt(prompt_text)\n  \n   md_content = f\"\"\"\n## Model: {model}\n\n**Response:** {out['response']}\n\n**Trustworthiness Score:** {out['trustworthiness_score']}\n\n**Explanation:** {out['log']['explanation']}\n\n---\n\"\"\"\n   display(Markdown(md_content))\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here is the response of the two models:<\/p>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f4f4\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"883\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12%25E2%2580%25AFAM-1024x883.png?resize=1024%2C883&#038;ssl=1\" alt=\"\" class=\"wp-image-597792 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12\u202fAM-1024x883.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12\u202fAM-300x259.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12\u202fAM-768x663.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12\u202fAM-1536x1325.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.44.12\u202fAM.png 1894w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Wrong outputs generated by gpt-4o and claude-3.5-sonnet-v2, represented by low trustworthiness score<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can also generate a trustworthiness score for open-source LLMs. Let\u2019s check the recent, much-hyped open-source LLM: deepseek-R1. I will use <em>DeepSeek-R1-Distill-Llama-70B<\/em>, based on Meta\u2019s <em>Llama-3.3\u201370B-Instruct model<\/em> and distilled from DeepSeek\u2019s larger 671-billion parameter Mixture of Experts (MoE) model. <a href=\"https:\/\/www.ibm.com\/think\/topics\/knowledge-distillation\">Knowledge distillation<\/a> is a <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> technique that aims to transfer the learnings of a large pre-trained model, the \u201cteacher model,\u201d to a smaller \u201cstudent model.\u201d<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import streamlit as st\nfrom langchain_groq.chat_models import ChatGroq\nimport os\nos.environ[\"GROQ_API_KEY\"]=st.secrets[\"GROQ_API_KEY\"]\n# Initialize the Groq Llama Instant model\ngroq_llm = ChatGroq(model=\"deepseek-r1-distill-llama-70b\", temperature=0.5)\nprompt = \"Which one of 9.11 and 9.9 is bigger?\"\n# Get the response from the model\nresponse = groq_llm.invoke(prompt)\n#Initialize Cleanlab's studio\nstudio = Studio(\"226eeab91e944b23bd817a46dbe3c8ae\") \ncleanlab_tlm = studio.TLM(options={\"log\": [\"explanation\"]})  #for explanations\n#Get the output containing trustworthiness score and explanation\noutput = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())\nmd_content = f\"\"\"\n## Model: {model}\n**Response:** {response.content.strip()}\n**Trustworthiness Score:** {output['trustworthiness_score']}\n**Explanation:** {output['log']['explanation']}\n---\n\"\"\"\ndisplay(Markdown(md_content))\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here is the output of <em>deepseek-r1-distill-llama-70b<\/em> model.<\/p>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f5f5f5\" data-has-transparency=\"true\" style=\"--dominant-color: #f5f5f5;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"560\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07%25E2%2580%25AFAM-1024x560.png?resize=1024%2C560&#038;ssl=1\" alt=\"\" class=\"wp-image-597793 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07\u202fAM-1024x560.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07\u202fAM-300x164.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07\u202fAM-768x420.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07\u202fAM-1536x840.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.46.07\u202fAM.png 1888w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">The correct output of deepseek-r1-distill-llama-70b model with a high trustworthiness score<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\"><strong>Developing a Trustworthy RAG<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">We will now develop an RAG to demonstrate how we can measure the trustworthiness of an LLM response in RAG. This RAG will be developed by scraping data from given links, parsing it in markdown format, and creating a vector store.<\/p>\n<p class=\"wp-block-paragraph\">The following libraries need to be installed for the next code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">pip install llama-parse llama-index-core llama-index-embeddings-huggingface \nllama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio<\/code><\/pre>\n<p class=\"wp-block-paragraph\">To render HTML into PDF format, we also need to install <em>wkhtmltopdf <\/em>command line tool from <a href=\"https:\/\/wkhtmltopdf.org\/downloads.html\">their website<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The following libraries will be imported:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from llama_parse import LlamaParse\nfrom llama_index.core import VectorStoreIndex\nimport requests\nfrom bs4 import BeautifulSoup\nimport pdfkit\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.core import Settings\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.core import VectorStoreIndex, SimpleDirectoryReader\nfrom llama_index.llms.cleanlab import CleanlabTLM\nfrom typing import Dict, List, ClassVar\nfrom llama_index.core.instrumentation.events import BaseEvent\nfrom llama_index.core.instrumentation.event_handlers import BaseEventHandler\nfrom llama_index.core.instrumentation import get_dispatcher\nfrom llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent\nimport nest_asyncio\nimport os\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next steps will involve scraping data from given URLs using Python\u2019s <em>BeautifulSoup<\/em> library, saving the scraped data in PDF file(s) using <em>pdfkit<\/em>, and parsing the data from PDF(s) to markdown file using <em>LlamaParse<\/em> which is a genAI-native document parsing platform built with LLMs and for LLM use cases.<\/p>\n<p class=\"wp-block-paragraph\">We will first configure the LLM to be used by CleanlabTLM and the embedding model (<em>Huggingface<\/em> embedding model <em>BAAI\/bge-small-en-v1.5<\/em>) that will be used to compute the embeddings of the scraped data to create the vector store.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">options = {\n   \"model\": \"gpt-4o\",\n   \"max_tokens\": 512,\n   \"log\": [\"explanation\"]\n}\nllm = CleanlabTLM(api_key=\"&lt;CLEANLAB_API_KEY&gt;\", options=options)#Get your free API from https:\/\/cleanlab.ai\/\nSettings.llm = llm\nSettings.embed_model = HuggingFaceEmbedding(\n   model_name=\"BAAI\/bge-small-en-v1.5\"\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We will now define a custom event handler, <em>GetTrustworthinessScore<\/em>, that is derived from a base event handler class. This handler gets triggered by the end of an LLM completion and extracts a trustworthiness score from the response metadata. A helper function, <em>display_response<\/em>, displays the LLM\u2019s response along with its trustworthiness score.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Event Handler for Trustworthiness Score\nclass GetTrustworthinessScore(BaseEventHandler):\n   events: ClassVar[List[BaseEvent]] = []\n   trustworthiness_score: float = 0.0\n   @classmethod\n   def class_name(cls) -&gt; str:\n       return \"GetTrustworthinessScore\"\n   def handle(self, event: BaseEvent) -&gt; Dict:\n       if isinstance(event, LLMCompletionEndEvent):\n           self.trustworthiness_score = event.response.additional_kwargs.get(\"trustworthiness_score\", 0.0)\n           self.events.append(event)\n       return {}\n# Helper function to display LLM's response\ndef display_response(response):\n   response_str = response.response\n   trustworthiness_score = event_handler.trustworthiness_score\n   print(f\"Response: {response_str}\")\n   print(f\"Trustworthiness score: {round(trustworthiness_score, 2)}\")<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We will now generate PDFs by scraping data from given URLs. For demonstration, we will scrap data only from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Large_language_model\">this Wikipedia article about large language models<\/a> (<em>Creative Commons Attribution-ShareAlike 4.0 License<\/em>).\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong>Note<\/strong>: Readers are advised to always double-check the status of the content\/data they are about to scrape and ensure they are allowed to do so.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The following piece of code scrapes data from the given URLs by making an HTTP request and using <em>BeautifulSoup <\/em>Python library to parse the HTML content. HTML content is cleaned by converting protocol-relative URLs to absolute ones. Subsequently, the scraped content is converted into a PDF file(s) using <em>pdfkit<\/em>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">##########################################\n# PDF Generation from Multiple URLs\n##########################################\n# Configure wkhtmltopdf path\nwkhtml_path = r'C:Program Fileswkhtmltopdfbinwkhtmltopdf.exe'\nconfig = pdfkit.configuration(wkhtmltopdf=wkhtml_path)\n# Define URLs and assign document names\nurls = {\n   \"LLMs\": \"https:\/\/en.wikipedia.org\/wiki\/Large_language_model\"\n}\n# Directory to save PDFs\npdf_directory = \"PDFs\"\nos.makedirs(pdf_directory, exist_ok=True)\npdf_paths = {}\nfor doc_name, url in urls.items():\n   try:\n       print(f\"Processing {doc_name} from {url} ...\")\n       response = requests.get(url)\n       soup = BeautifulSoup(response.text, \"html.parser\")\n       main_content = soup.find(\"div\", {\"id\": \"mw-content-text\"})\n       if main_content is None:\n           raise ValueError(\"Main content not found\")\n       # Replace protocol-relative URLs with absolute URLs\n       html_string = str(main_content).replace('src=\"\/\/', 'src=\"https:\/\/').replace('href=\"\/\/', 'href=\"https:\/\/')\n       pdf_file_path = os.path.join(pdf_directory, f\"{doc_name}.pdf\")\n       pdfkit.from_string(\n           html_string,\n           pdf_file_path,\n           options={'encoding': 'UTF-8', 'quiet': ''},\n           configuration=config\n       )\n       pdf_paths[doc_name] = pdf_file_path\n       print(f\"Saved PDF for {doc_name} at {pdf_file_path}\")\n   except Exception as e:\n       print(f\"Error processing {doc_name}: {e}\")<\/code><\/pre>\n<p class=\"wp-block-paragraph\">After generating PDF(s) from the scraped data, we parse these PDFs using <em>LlamaParse<\/em>. We set the parsing instructions to extract the content in markdown format and parse the document(s) page-wise along with the document name and page number. These extracted entities (pages) are referred to as <em>nodes<\/em>. The parser iterates over the extracted nodes and updates each node\u2019s metadata by appending a citation header which facilitates later referencing.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">##########################################\n# Parse PDFs with LlamaParse and Inject Metadata\n##########################################\n\n# Define parsing instructions (if your parser supports it)\nparsing_instructions = \"\"\"Extract the document content in markdown.\nSplit the document into nodes (for example, by page).\nEnsure each node has metadata for document name and page number.\"\"\"\n      \n# Create a LlamaParse instance\nparser = LlamaParse(\n   api_key=\"&lt;LLAMACLOUD_API_KEY&gt;\",  #Replace with your actual key\n   parsing_instructions=parsing_instructions,\n   result_type=\"markdown\",\n   premium_mode=True,\n   max_timeout=600\n)\n# Directory to save combined Markdown files (one per PDF)\noutput_md_dir = os.path.join(pdf_directory, \"markdown_docs\")\nos.makedirs(output_md_dir, exist_ok=True)\n# List to hold all updated nodes for indexing\nall_nodes = []\nfor doc_name, pdf_path in pdf_paths.items():\n   try:\n       print(f\"Parsing PDF for {doc_name} from {pdf_path} ...\")\n       nodes = parser.load_data(pdf_path)  # Returns a list of nodes\n       updated_nodes = []\n       # Process each node: update metadata and inject citation header into the text.\n       for i, node in enumerate(nodes, start=1):\n           # Copy existing metadata (if any) and add our own keys.\n           new_metadata = dict(node.metadata) if node.metadata else {}\n           new_metadata[\"document_name\"] = doc_name\n           if \"page_number\" not in new_metadata:\n               new_metadata[\"page_number\"] = str(i)\n           # Build the citation header.\n           citation_header = f\"[{new_metadata['document_name']}, page {new_metadata['page_number']}]nn\"\n           # Prepend the citation header to the node's text.\n           updated_text = citation_header + node.text\n           new_node = node.__class__(text=updated_text, metadata=new_metadata)\n           updated_nodes.append(new_node)\n       # Save a single combined Markdown file for the document using the updated node texts.\n       combined_texts = [node.text for node in updated_nodes]\n       combined_md = \"nn---nn\".join(combined_texts)\n       md_filename = f\"{doc_name}.md\"\n       md_filepath = os.path.join(output_md_dir, md_filename)\n       with open(md_filepath, \"w\", encoding=\"utf-8\") as f:\n           f.write(combined_md)\n       print(f\"Saved combined markdown for {doc_name} to {md_filepath}\")\n       # Add the updated nodes to the global list for indexing.\n       all_nodes.extend(updated_nodes)\n       print(f\"Parsed {len(updated_nodes)} nodes from {doc_name}.\")\n   except Exception as e:\n       print(f\"Error parsing {doc_name}: {e}\")<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We now create a vector store and a query engine. We define a customer prompt template to guide the LLM\u2019s behavior in answering the questions. Finally, we create a query engine with the created index to answer queries. For each query, we retrieve the top 3 nodes from the vector store based on their semantic similarity with the query. The LLM uses these retrieved nodes to generate the final answer.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">##########################################\n# Create Index and Query Engine\n##########################################\n# Create an index from all nodes.\nindex = VectorStoreIndex.from_documents(documents=all_nodes)\n# Define a custom prompt template that forces the inclusion of citations.\nprompt_template = \"\"\"\nYou are an AI assistant with expertise in the subject matter.\nAnswer the question using ONLY the provided context.\nAnswer in well-formatted Markdown with bullets and sections wherever necessary.\nIf the provided context does not support an answer, respond with \"I don't know.\"\nContext:\n{context_str}\nQuestion:\n{query_str}\nAnswer:\n\"\"\"\n# Create a query engine with the custom prompt.\nquery_engine = index.as_query_engine(similarity_top_k=3, llm=llm, prompt_template = prompt_template)\nprint(\"Combined index and query engine created successfully!\")<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now let\u2019s test the RAG for some queries and their corresponding trustworthiness scores.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">query = \"When is mixture of experts approach used?\"\nresponse = query_engine.query(query)\ndisplay_response(response)<\/code><\/pre>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeee;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"88\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24%25E2%2580%25AFAM-1024x88.png?resize=1024%2C88&#038;ssl=1\" alt=\"\" class=\"wp-image-597794 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24\u202fAM-1024x88.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24\u202fAM-300x26.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24\u202fAM-768x66.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24\u202fAM-1536x131.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.54.24\u202fAM-2048x175.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Response to the query \u2018When is mixture of experts approach used?\u2019 (image by author)<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">query = \"How do you compare Deepseek model with OpenAI's models?\"\nresponse = query_engine.query(query)\ndisplay_response(response)<\/code><\/pre>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0f0f0\" data-has-transparency=\"true\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"429\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34%25E2%2580%25AFAM-1024x429.png?resize=1024%2C429&#038;ssl=1\" alt=\"\" class=\"wp-image-597795 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34\u202fAM-1024x429.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34\u202fAM-300x126.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34\u202fAM-768x322.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34\u202fAM-1536x643.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-12-at-10.55.34\u202fAM-2048x858.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Response to the query \u2018How do you compare the Deepseek model with OpenAI\u2019s models?\u2019 (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Assigning a trustworthiness score to LLM\u2019s response, whether generated through direct inference or RAG, helps to define the reliability of AI\u2019s output and prioritize human verification where needed. This is particularly important for critical domains where a wrong or unreliable response could have severe consequences.\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><em>That\u2019s all folks! If you like the article, please follow me on <\/em><a href=\"https:\/\/medium.com\/@umairali.khan\"><em>Medium<\/em><\/a><em> and <\/em><a href=\"http:\/\/www.linkedin.com\/in\/uakhan80\"><em>LinkedIn<\/em><\/a><em>.<\/em><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-to-measure-the-reliability-of-a-large-language-models-response\/\">How to Measure the Reliability of a Large Language Model\u2019s Response<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Umair Ali Khan<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-to-measure-the-reliability-of-a-large-language-models-response\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Measure the Reliability of a Large Language Model\u2019s Response The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,71,1730,87,70,1500,1648],"tags":[7,146,134],"class_list":["post-1821","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-large-language-models","category-llamaindex","category-llm","category-machine-learning","category-model-evaluation","category-retrieval-augmented","tag-how","tag-language","tag-llm"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1821"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1821"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1821\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1821"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1821"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1821"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}