{"id":1377,"date":"2025-01-23T07:03:49","date_gmt":"2025-01-23T07:03:49","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/23\/how-to-evaluate-llm-summarization-18a040c3905d\/"},"modified":"2025-01-23T07:03:49","modified_gmt":"2025-01-23T07:03:49","slug":"how-to-evaluate-llm-summarization-18a040c3905d","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/23\/how-to-evaluate-llm-summarization-18a040c3905d\/","title":{"rendered":"How to Evaluate LLM Summarization"},"content":{"rendered":"<p>    How to Evaluate LLM Summarization<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>A practical and effective guide for evaluating AI summaries<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A4EjerlC79zcw-_zuyKioEw.jpeg?ssl=1\"><figcaption>Image from\u00a0Unsplash<\/figcaption><\/figure>\n<p><strong>Summarization<\/strong> is one of the most practical and convenient tasks enabled by LLMs. However, compared to other LLM tasks like question-asking or classification, evaluating LLMs on summarization is far more challenging.<\/p>\n<p>And so I myself have neglected evals for summarization, even though two apps I\u2019ve built rely heavily on summarization (<a href=\"https:\/\/podsmartai.com\/\"><em>Podsmart<\/em> <\/a>summarizes podcasts, while <a href=\"https:\/\/www.airead.me\/\"><em>aiRead<\/em><\/a> creates personalized PDF summaries based on your highlights)<\/p>\n<p>But recently, I\u2019ve been persuaded\u200a\u2014\u200athanks to insightful posts from thought leaders in the AI industry\u200a\u2014\u200aof the critical role of evals in systematically assessing and improving LLM systems. (<a href=\"https:\/\/hamel.dev\/blog\/posts\/evals\/\">link <\/a>and <a href=\"https:\/\/applied-llms.org\/\">link<\/a>). This motivated me to start investigating evals for summaries.<\/p>\n<p>So in this article, I will talk about an <strong>easy-to-implement, research-backed and quantitative framework to evaluate summaries,<\/strong> which improves on the Summarization metric in the<a href=\"https:\/\/docs.confident-ai.com\/docs\/getting-started\"> DeepEval<\/a> framework created by Confident AI.<\/p>\n<p>I will illustrate my process with an example notebook (code in <a href=\"https:\/\/github.com\/thamsuppp\/summary-eval-article\">Github<\/a>), attempting to evaluate a ~500-word summary of a ~2500-word article <em>Securing the AGI Laurel: Export Controls, the Compute Gap, and China\u2019s Counterstrategy<\/em> (found <a href=\"https:\/\/www.csis.org\/analysis\/securing-agi-laurel-export-controls-compute-gap-and-chinas-counterstrategy\">here<\/a>, published in December\u00a02024).<\/p>\n<h4>Table of\u00a0Contents<\/h4>\n<p>\u2218 <a href=\"https:\/\/towardsdatascience.com\/#f083\">Why it\u2019s difficult to evaluate summarization<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#05af\">What makes a good summary<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#60ce\">Introduction to DeepEval<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#f6d5\">DeepEval\u2019s Summarization Metric<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#8959\">Improving the Summarization Metric<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#6a41\">Conciseness Metrics<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#d06e\">Coherence Metric<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#7f21\">Putting it all together<\/a><br \/> \u2218 <a href=\"https:\/\/towardsdatascience.com\/#b3a6\">Future\u00a0Work<\/a><\/p>\n<h4><strong>Why it\u2019s difficult to evaluate summarization<\/strong><\/h4>\n<p>Before I start, let me elaborate on why I claim that summarization is a difficult task to evaluate.<\/p>\n<p>Firstly, the output of a summary is inherently open-ended (as opposed to tasks like classification or entity extraction). So, what makes a summary good depends on qualitative metrics such as fluency, coherence and consistency, which are not straightforward to measure quantitatively. Furthermore, these metrics are often subjective\u200a\u2014\u200afor example, relevance depends on the context and audience.<\/p>\n<p>Secondly, it is difficult to create gold-labelled datasets to evaluate your system\u2019s summaries against. For RAG, it is straightforward to create a dataset of synthetic question-answer pairs to evaluate the retriever (see this <a href=\"https:\/\/playbooks.capdev.govtext.gov.sg\/appendix_dataset\/\">nice walkthrough<\/a>).<\/p>\n<p>For summarization, there isn\u2019t an obvious way to generate reference summaries automatically, so we have to turn to humans to create them. While researchers have curated summarization datasets, these would not be customized to your use\u00a0case.<\/p>\n<p>Thirdly, I find that most summarization metrics in the academic literature are not suitable for practical-oriented AI developers to implement. Some papers trained neural summarization metrics (e.g. <a href=\"https:\/\/github.com\/google-research-datasets\/seahorse\">Seahorse<\/a>, <a href=\"https:\/\/github.com\/tingofurro\/summac\">Summac <\/a>etc.), which are several GBs big and challenging to run at scale (<em>perhaps I\u2019m just lazy and should learn how to run HuggingFace models locally and on a GPU cluster, but still it\u2019s a barrier to entry for most<\/em>). Other traditional metrics such as BLEU and ROUGE rely on exact word\/phrase overlap and were created in the pre-LLM era for extractive summarization, and may not work well for evaluating abstractive summaries generated by LLMs, which could paraphrase the source\u00a0text.<\/p>\n<p>Nevertheless, in my experience, humans can easily distinguish a good summary from a bad one. One common failure mode is being vague and roundabout-y (e.g. \u2018<em>this summary describes the reasons\u00a0for\u2026\u2019).<\/em><\/p>\n<h4><strong>What makes a good\u00a0summary<\/strong><\/h4>\n<p>So what is a good summary? Eugene Yan\u2019s <a href=\"https:\/\/eugeneyan.com\/writing\/abstractive\/\">article<\/a> offers good detail on various summary metrics. For me, I will distil them into 4 key qualities:<\/p>\n<ol>\n<li>\n<strong>Relevant<\/strong>\u200a\u2014\u200athe summary retains important points and details from the source\u00a0text<\/li>\n<li>\n<strong>Concise<\/strong>\u200a\u2014\u200athe summary is information-dense, does not repeat the same point multiple times, and is not unnecessarily verbose<\/li>\n<li>\n<strong>Coherent<\/strong>\u200a\u2014\u200athe summary is well-structured and easy to follow, not just a jumble of condensed facts<\/li>\n<li>\n<strong>Faithful<\/strong>\u200a\u2014\u200athe summary does not hallucinate information that is not supported by the source\u00a0text<\/li>\n<\/ol>\n<p>One key insight is that you can actually formulate the first two as a <strong>precision and recall<\/strong> problem\u200a\u2014\u200ahow many facts from the source text are retained in the summary (<strong>recall<\/strong>), and how many facts from the summary are supported by the main text (<strong>precision<\/strong>).<\/p>\n<p>This formulation brings us back to more familiar territory of classification problems in ML, and suggests a quantitative way to evaluate summaries.<\/p>\n<p>Some differences here are: firstly, a higher recall is better, <em>holding summary length constant<\/em>. You don\u2019t want to score 100% recall with a summary the same length as the source. Secondly, you\u2019d ideally want precision to be close to 100% as possible\u200a\u2014\u200ahallucinating information is really bad. I\u2019ll come back to these\u00a0later.<\/p>\n<h4><strong>Introduction to\u00a0DeepEval<\/strong><\/h4>\n<p>You\u2019d be spoilt for choice with all the different LLM eval frameworks out there\u200a\u2014\u200afrom Braintrust to Langfuse and more. However, today I\u2019ll be using DeepEval, a very user-friendly framework to get started quickly, both in general, as well as specifically with summarization.<\/p>\n<p>DeepEval has easy out-of-the-box implementations of many key RAG metrics, and it has a flexible Chain-of-Thought-based LLM-as-a-judge tool called GEval for you too define any custom criteria you want (I\u2019ll use this\u00a0later)<\/p>\n<p>Additionally, it has helpful infrastructure to organize and speed up evals: they\u2019ve nicely <strong>parallelized<\/strong> everything with async and so you can run evals on your entire dataset rapidly. They have handy features for <strong>synthetic<\/strong> data generation (<em>will cover in later articles<\/em>), and they allow you to define <strong>custom metrics<\/strong> to adapt their metrics (exactly what we\u2019re going to do today), or to define non-LLM-based eval metrics for more cost-effective &amp; robust evals (e.g. entity density,\u00a0later).<\/p>\n<h4><strong>DeepEval\u2019s Summarization Metric<\/strong><\/h4>\n<p>DeepEval\u2019s summarization metric (read more about it <a href=\"https:\/\/docs.confident-ai.com\/docs\/metrics-summarization\">here <\/a>) is a reference-free metric (i.e. no need for gold-standard summaries), and just requires the source text (that you put as input field) and the generated summary to be evaluated (actual_output) field. As you can see, the set-up and evaluation code below is really\u00a0simple!<\/p>\n<pre># Create a DeepEval test case for the purposes of the evaluation<br>test_case = LLMTestCase(<br>  input = text,<br>  actual_output = summary<br>)<br><br># Instantiate the summarization metric<br>summarization_metric = SummarizationMetric(verbose_mode = True, n = 20, truths_extraction_limit = 20)<br><br># Run the evaluation on the test case<br>eval_result = evaluate([test_case], [summarization_metric])<\/pre>\n<p>The summarization metric actually evaluates two separate components under-the-hood: <strong>alignment<\/strong> and <strong>coverage<\/strong>. These correspond closely to the <strong>precision<\/strong> and <strong>recall<\/strong> formulation I introduced earlier!<\/p>\n<p>For alignment, the evaluator LLM generates a list of <strong>claims<\/strong> from the summary, and for each claim, the LLM will determine how many of these claims are supported by <strong>truths<\/strong> which are extracted from the source text, producing the <strong>alignment score<\/strong>.<\/p>\n<p>In the case of coverage, the LLM generates a list of assessment questions from the source text, then tries to answer the questions, using only the summary as context. The LLM is prompted to respond \u2018idk\u2019 if the answer cannot be found. Then, the LLM will determine how many of these answers are correct, to get the <strong>coverage\u00a0score<\/strong>.<\/p>\n<p>The final summarization score is the minimum of the alignment and coverage\u00a0scores.<\/p>\n<h4><strong>Improving the Summarization Metric<\/strong><\/h4>\n<p>However, while what DeepEval has done is a great starting point, there are three key issues that hinder the reliability and usefulness of the Summarization metric in its current\u00a0form.<\/p>\n<p>So I have built a <strong>custom summarization metric<\/strong> which adapts DeepEval\u2019s version. Below, I\u2019ll explain each problem and the corresponding solution I\u2019ve implemented to overcome\u00a0it:<\/p>\n<p><strong>1: Using yes\/no questions for the coverage metric is too simplistic<\/strong><\/p>\n<p>Currently, the assessment questions are constrained to be yes\/no questions, in which the answer to the question is yes\u200a\u2014\u200ahave a look at the questions:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/936\/1%2AkRfVtI28jSFD2l0GTv1sow.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<p>There are two problems with\u00a0this:<\/p>\n<p>Firstly, by framing the questions as binary yes\/no, this limits their informativeness, especially in determining nuanced qualitative points.<\/p>\n<p>Secondly, if the LLM that answers given the summary hallucinates a \u2018yes\u2019 answer (as there are only 3 possible answers: \u2018yes\u2019, \u2018no\u2019, \u2018idk\u2019, it\u2019s not unlikely it\u2019ll hallucinate yes), the evaluator will erroneously deem this answer to be correct. It is much more difficult to hallucinate the correct answer to an open-ended question. Furthermore, if you look at the questions, they are phrased in a contrived way almost hinting that the answer is \u2018yes\u2019 (e.g. \u201cDoes China employ informational opacity as a strategy?\u201d), hence increasing the likelihood of a hallucinated \u2018yes\u2019.<\/p>\n<p>My solution was to ask the LLM <strong>generate open-ended questions<\/strong> from the source text\u200a\u2014\u200ain the code, these are referred to as \u2018complex questions\u2019.<\/p>\n<p>Additionally, I ask the LLM to assign an <strong>importance of the question<\/strong> (so we can perhaps upweight more important questions in the coverage\u00a0score).<\/p>\n<p>Since the questions are now open-ended, I use an <strong>LLM for evaluation<\/strong>\u200a\u2014\u200aI ask the LLM to give a <strong>0\u20135 score of how similar<\/strong> the answer generated from the summary is to the answer generated with the source text (the reference answer), as well as an explanation.<\/p>\n<pre>def generate_complex_verdicts(answers):<br>    return f\"\"\"You are given a list of JSON objects. Each contains 'original_answer' and 'summary_answer'.<br>    Original answer is the correct answer to a question. <br>    Your job is to assess if the summary answer is correct, based on the model answer which is the original answer.<br>    Give a score from 0 to 5, with 0 being completely wrong, and 5 being completely correct.<br>    If the 'summary_answer' is 'idk', return a score of 0.<br><br>    Return a JSON object with the key 'verdicts', which is a list of JSON objects, with the keys: 'score', and 'reason': a concise 1 sentence explanation for the score.<br>...\"\"\"<br><br>def generate_complex_questions(text, n):<br>        return f\"\"\"Based on the given text, generate a list of {n} questions that can be answered with the information in this document.<br>        The questions should be related to the main points of this document. <br>        Then, provide a concise 1 sentence answer to the question, using only information that can be found in the document.<br>        Answer concisely, your answer does not need to be in full sentences.<br>        Make sure the questions are different from each other. <br>        They should cover a combination of questions on cause, impact, policy, advantages\/disadvantages, etc.<br><br>        Lastly, rate the importance of this question to the document on a scale of 1 to 5, with 1 being not important and 5 being most important. <br>        Important question means the question is related to an essential or main point of the document,<br>        and that not knowing the answer to this question would mean that the reader has not understood the document's main point at all.<br>        A less important question is one asking about a smaller detail, that is not essential to understanding the document's main point.<br><br> ...\"\"\"<\/pre>\n<p><strong>2: Extracting truths from source text for alignment is\u00a0flawed<\/strong><\/p>\n<p>Currently, for the alignment metric, a list of <strong>truths<\/strong> is extracted from the source text using an LLM (a parameter truths_extraction_limit which can be controlled). This leads to some facts\/details from the source text being omitted from the truths, which the summary\u2019s claims are then compared\u00a0against.<\/p>\n<p>To be honest, I\u2019m not sure what the team was thinking when they implemented it like this\u200a\u2014\u200aperhaps I had missed a nuance or misunderstood their intention.<\/p>\n<p>However, this leads to two problems that renders the alignment score \u2018unusable\u2019 according to a <a href=\"https:\/\/github.com\/confident-ai\/deepeval\/issues\/937\">user on\u00a0Github<\/a>.<\/p>\n<p>Firstly, the LLM-generated list of truths is non-deterministic, hence people have reported wildly changing alignment scores. This inconsistency likely stems from the LLM choosing different subsets of truths each time. More critically, the truth extraction process makes this not a fair judge of the summary\u2019s faithfulness, because a detail from the summary could possibly be found in the source text but not the extracted truths. Anecdotally, all the claims that were detected as unfaithful, indeed were in the main text but not in the extracted truths. Additionally, <a href=\"https:\/\/github.com\/confident-ai\/deepeval\/issues\/565\">people have reported<\/a> that when you pass in the summary as equal to input, the alignment score is less than 1, which is\u00a0strange.<\/p>\n<p>To address this, I just made a simple adjustment\u200a\u2014\u200awhich was to pass the <strong>entire source text<\/strong> into the LLM evaluating the summary claims, instead of the list of truths. Since all the claims are evaluated together in one LLM call, this won\u2019t significantly raise token\u00a0costs.<\/p>\n<p><strong>3: Final score being min(alignment score, coverage score) is\u00a0flawed<\/strong><\/p>\n<p>Currently, the score that is output is the minimum of the alignment and coverage scores (and there\u2019s actually no way of accessing the individual scores without placing it in the\u00a0logs).<\/p>\n<p>This is problematic, because the coverage score will likely be lower than the alignment score (if not, then there\u2019re real problems!). This means that changes in the alignment score do not affect the final score. However, that doesn\u2019t mean that we can ignore deteriorations in the alignment score (say from 1 to 0.8), which are arguably signal a more severe problem with the summary (i.e. hallucinating a\u00a0claim).<\/p>\n<p>My solution was to <strong>change the final score to the F1 score<\/strong>, just like in ML classification, to capture importance of both precision and recall. An extension is to can change the weighting of precision &amp; recall. (e.g. upweight precision if you think that hallucination is something to avoid at all costs\u200a\u2014\u200asee\u00a0<a href=\"https:\/\/stats.stackexchange.com\/questions\/559736\/is-there-a-metric-that-combines-recall-and-precision-other-than-the-f1-score\">here<\/a>)<\/p>\n<p>With these 3 changes, the summarization metric now better reflects the relevance and faithfulness of the generated summaries.<\/p>\n<h4><strong>Conciseness Metrics<\/strong><\/h4>\n<p>However, this still gives an incomplete picture. A summary should also <strong>concise<\/strong> and information-dense, condensing key information into a shorter\u00a0version.<\/p>\n<p><strong>Entity density<\/strong> is a useful and cheap metric to look at. The Chain-of-Density paper shows that human-created summaries, as well as human-preferred AI-generated summaries, have an entity density of ~0.15 entities\/tokens, striking the right balance between clarity (favoring less dense) and informativeness (favoring more\u00a0dense).<\/p>\n<p>Hence, we can create a <strong>Density Score<\/strong> which penalizes summaries with Entity Density further away from 0.15 (either too dense or not dense enough). Initial AI-generated summaries are typically less dense (0.10 or less), and the Chain-of-Density <a href=\"https:\/\/arxiv.org\/pdf\/2309.04269\">paper<\/a> shows an iterative process to increase the density of summaries. Ivan Leo &amp; Jason Liu wrote a good <a href=\"https:\/\/python.useinstructor.com\/blog\/2023\/11\/05\/chain-of-density\/\">article<\/a> on fine-tuning Chain-of-Density summaries using entity density as the key\u00a0metric.<\/p>\n<pre>import nltk<br>import spacy<br>nlp = spacy.load(\"en_core_web_sm\")<br><br>def get_entity_density(text):<br>  summary_tokens = nltk.word_tokenize(text)<br>  num_tokens = len(summary_tokens)<br>  # Extract entities<br>  doc = nlp(text)<br>  num_entities = len(doc.ents)<br>  entity_density = num_entities \/ num_tokens<br>  return entity_density<\/pre>\n<p>Next, I use a <strong>Sentence Vagueness<\/strong> metric to explicitly penalize vague sentences ( \u2018<em>this summary describes the reasons for\u2026\u2019<\/em>) that don\u2019t actually state the key information.<\/p>\n<p>For this, I break up the summary into sentences (similar to the alignment metric) and ask an LLM to classify if each sentence is vague or not, with the final score being the proportion of sentences classified as\u00a0vague.<\/p>\n<pre>prompt = ChatPromptTemplate.from_template(<br>    \"\"\"You are given a list of sentences from a summary of a text.<br>    For each sentence, your job is to evaluate if the sentence is vague, and hence does not help in summarizing the key points of the text.<br><br>    Vague sentences are those that do not directly mention a main point, e.g. 'this summary describes the reasons for China's AI policy'. <br>    Such a sentence does not mention the specific reasons, and is vague and uninformative.<br>    Sentences that use phrases such as 'the article suggests', 'the author describes', 'the text discusses' are also considered vague and verbose.<br>  ...<br>    OUTPUT:\"\"\"<br>)<br><br>class SentenceVagueness(BaseModel):<br>    sentence_id: int<br>    is_vague: bool<br>    reason: str<br><br>class SentencesVagueness(BaseModel):<br>    sentences: List[SentenceVagueness]<br><br>chain = prompt | llm.with_structured_output(SentencesVagueness)<\/pre>\n<p>Lastly, a summary that repeats the same information is inefficient, as it wastes valuable space that could have been used to convey new meaningful insights.<\/p>\n<p>Hence, we construct a <strong>Repetitiveness<\/strong> score using <strong>GEval<\/strong>. As I briefly mentioned above, GEval uses LLM-as-a-judge with chain-of-thoughts to evaluate any custom criteria. As detecting repeated concepts is a more complex problem, we need a more intelligent detector aka an LLM. (<em>Warning: the results for this metric seemed quite unstable\u200a\u2014\u200athe LLM would change its answer when I ran it repeatedly on the same input. Perhaps try some prompt engineering)<\/em><\/p>\n<pre>from deepeval.metrics import GEval<br>from deepeval.test_case import LLMTestCaseParams<br><br>repetitiveness_metric = GEval(<br>    name=\"Repetitiveness\",<br>    criteria=\"\"\"I do not want my summary to contain unnecessary repetitive information.<br>    Return 1 if the summary does not contain unnecessarily repetitive information, and 0 if the summary contains unnecessary repetitive information.<br>    facts or main points that are repeated more than once. Points on the same topic, but talking about different aspects, are OK. In your reasoning, point out any unnecessarily repetitive points.\"\"\",<br>    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],<br>    verbose_mode = True<br>)<\/pre>\n<h4><strong>Coherence Metric<\/strong><\/h4>\n<p>Finally, we want to ensure that LLM outputs are coherent\u200a\u2014\u200ahaving a logical flow with related points together and making smooth transitions. Meta\u2019s recent Large Concept Models <a href=\"https:\/\/arxiv.org\/pdf\/2412.08821\">paper <\/a>used this metric for local coherence from Parola et al (2023)\u200a\u2014\u200athe average cosine similarity between each nth and n+2th sentence. A simple metric that is easily implemented. We find that the LLM summary has a score of ~0.45. As a sense check, if we randomly permute the sentences of the summary, the coherence score drops below\u00a00.4.<\/p>\n<pre># Calculate cosine similarity between each nth and n+2th sentence<br>def compute_coherence_score(sentences):<br>  embedding_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")<br>  sentences_embeddings = embedding_model.embed_documents(sentences)<br>  sentence_similarities = []<br>  for i in range(len(sentences_embeddings) - 2):<br>    # Convert embeddings to numpy arrays and reshape to 2D<br>    emb1 = np.array(sentences_embeddings[i])<br>    emb2 = np.array(sentences_embeddings[i+2])<br>    # Calculate cosine distance<br>    distance = cosine(emb1, emb2)<br>    similarity = 1 - distance<br>    sentence_similarities.append(similarity)<br>  coherence_score = np.mean(sentence_similarities)<br>  return coherence_score<\/pre>\n<h4><strong>Putting it all\u00a0together<\/strong><\/h4>\n<p>We can package each of the above metrics into Custom Metrics. The benefit is that we can evaluate all of them in parallel on your dataset of summaries and get all your results in one place! (see the <a href=\"https:\/\/github.com\/thamsuppp\/summary-eval-article\/blob\/main\/summary_eval_article_notebook.ipynb\">code notebook<\/a>)<\/p>\n<p>One caveat, though, is that for some of these metrics, like Coherence or Recall, there isn\u2019t a sense of what the \u2018optimal\u2019 value is for a summary, and we can only compare scores across different AI-generated summaries to determine better or\u00a0worse.<\/p>\n<h4><strong>Future Work<\/strong><\/h4>\n<p>What I\u2019ve introduced in this article provides a solid starting point for evaluating your summaries!<\/p>\n<p>It\u2019s not perfect though, and there areas for future exploration and improvement.<\/p>\n<p>One area is to better test whether the summaries capture <strong>important points<\/strong> from the source text. You don\u2019t want a summary that has a high recall, but of unimportant details.<\/p>\n<p>Currently, when we generate assessment questions, we ask LLM to rate their <strong>importance<\/strong>. However, it\u2019s hard to take those importance ratings as the ground-truth either\u200a\u2014\u200aif you think about it, when LLMs summarize they essentially rate the importance of different facts too. Hence, we need a measure of <strong>importance outside the LLM.<\/strong> Of course, the ideal is to have human reference summaries, but these are expensive and not scalable. Another source of reference summaries would be reports with executive summaries (e.g. finance pitches, conclusion from slide decks, abstract from papers). We could also use techniques like the PageRank of embeddings to identify the central concepts algorithmically.<\/p>\n<p>An interesting idea to try is generating <strong>synthetic source articles\u200a<\/strong>\u2014\u200astart with a set of main points (representing ground-truth \u201cimportant\u201d points) on a given topic, and then ask the LLM lengthen into a full article (run this multiple times with high temperature to generate many diverse synthetic articles!). Then run the full articles through the summarization process, and evaluate the summaries on retaining the original main\u00a0points.<\/p>\n<p>Last but not least, it is very important to ensure that each of the summarization metrics I\u2019ve introduced <strong>correlates with human evaluations of summary preference<\/strong>. While researchers have done so for some metrics on large summarization datasets, these findings might not generalize to your texts and\/or audience. (perhaps your company prefers a specific style of summaries e.g. with many statistics).<\/p>\n<p>For an excellent discussion on this topic, see \u2018Level 2\u2019 of Hamel Husain\u2019s <a href=\"https:\/\/hamel.dev\/blog\/posts\/evals\/#automated-evaluation-w-llms\">article on evals<\/a>. For example, if you find that LLM\u2019s Sentence Vagueness scores don\u2019t correlate well with what you consider to be vague sentences, then some prompt engineering (providing examples of vague sentences, elaborating more) can hopefully bring the correlation up.<\/p>\n<p>Although this step can be time-consuming, it is essential, in order to ensure you can trust the LLM evals. This will save you time in the long run anyway\u200a\u2014\u200awhen your LLM evals are aligned, you essentially gain an infinitely-scalable evaluator customised to your needs and preferences.<\/p>\n<p>You can speed up your human evaluation process by creating an easy-to-use Gradio annotation interface\u200a\u2014\u200aI one-shotted a decent interface using OpenAI\u00a0o1!<\/p>\n<p>In a future article, I will talk about how to actually use these insights to improve my summarization process. Two years ago I <a href=\"https:\/\/medium.com\/towards-data-science\/summarize-podcast-transcripts-and-long-texts-better-with-nlp-and-ai-e04c89d3b2cb\">wrote<\/a> about how to summarize long texts, but both LLM advances and 2 years of experience have led to my summarization methods changing dramatically.<\/p>\n<p>Thanks so much for reading! In case you missed it, all the code can be found in the GitHub repo <a href=\"https:\/\/github.com\/thamsuppp\/summary-eval-article\">here<\/a>. Follow me on <a href=\"https:\/\/x.com\/thamsuppp\">X\/Twitter<\/a> and for more posts on\u00a0AI!<\/p>\n<p>What metrics do you use to evaluate LLM summarization? Let me know in the comments!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=18a040c3905d\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-evaluate-llm-summarization-18a040c3905d\">How to Evaluate LLM Summarization<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Isaac Tham<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-evaluate-llm-summarization-18a040c3905d\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Evaluate LLM Summarization A practical and effective guide for evaluating AI summaries Image from\u00a0Unsplash Summarization is one of the most practical and convenient tasks enabled by LLMs. However, compared to other LLM tasks like question-asking or classification, evaluating LLMs on summarization is far more challenging. And so I myself have neglected evals for [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,87,260,92],"tags":[572,1422,1421],"class_list":["post-1377","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-llm","category-nlp","category-thoughts-and-theory","tag-evaluate","tag-summaries","tag-summarization"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1377"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1377"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1377\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1377"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1377"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1377"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}