{"id":728,"date":"2024-12-21T07:04:57","date_gmt":"2024-12-21T07:04:57","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/21\/semantically-compress-text-to-save-on-llm-costs-0b3e62b0c43a\/"},"modified":"2024-12-21T07:04:57","modified_gmt":"2024-12-21T07:04:57","slug":"semantically-compress-text-to-save-on-llm-costs-0b3e62b0c43a","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/21\/semantically-compress-text-to-save-on-llm-costs-0b3e62b0c43a\/","title":{"rendered":"Semantically Compress Text to Save On LLM Costs"},"content":{"rendered":"<p>    Semantically Compress Text to Save On LLM Costs<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>LLMs are great\u2026 if they can fit all of your\u00a0data<\/h4>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*LLApz7tkaqL9eQFl\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@christopher__burns?utm_source=medium&amp;utm_medium=referral\">Christopher Burns<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p><em>Originally published at <\/em><a href=\"https:\/\/blog.developer.bazaarvoice.com\/2024\/10\/28\/semantically-compress-text-to-save-on-llm-costs\/\"><em>https:\/\/blog.developer.bazaarvoice.com<\/em><\/a><em> on October 28,\u00a02024.<\/em><\/p>\n<h3>Introduction<\/h3>\n<p>Large language models are fantastic tools for unstructured text, but what if your text doesn\u2019t fit in the context window? Bazaarvoice faced exactly this challenge when building our AI Review Summaries feature: millions of user reviews simply won\u2019t fit into the context window of even newer LLMs and, even if they did, it would be prohibitively expensive.<\/p>\n<p>In this post, I share how Bazaarvoice tackled this problem by compressing the input text without loss of semantics. Specifically, we use a multi-pass hierarchical clustering approach that lets us explicitly adjust the level of detail we want to lose in exchange for compression, regardless of the embedding model chosen. The final technique made our Review Summaries feature financially feasible and set us up to continue to scale our business in the\u00a0future.<\/p>\n<h3>The Problem<\/h3>\n<p>Bazaarvoice has been collecting user-generated product reviews for nearly 20 years so we have <em>a lot<\/em> of data. These product reviews are completely unstructured, varying in length and content. Large language models are excellent tools for unstructured text: they can handle unstructured data and identify relevant pieces of information amongst distractors.<\/p>\n<p>LLMs have their limitations, however, and one such limitation is the context window: how many tokens (roughly the number of words) can be put into the network at once. State-of-the-art large language models, such as Athropic\u2019s Claude version 3, have extremely large context windows of up to 200,000 tokens. This means you can fit small novels into them, but the internet is still a vast, every-growing collection of data, and our user-generated product reviews are no different.<\/p>\n<p>We hit the context window limit while building our Review Summaries feature that summarizes all of the reviews of a specific product on our clients website. Over the past 20 years, however, many products have garnered thousands of reviews that quickly overloaded the LLM context window. In fact, we even have products with millions of reviews that would require immense re-engineering of LLMs to be able to process in one\u00a0prompt.<\/p>\n<p>Even if it was technically feasible, the costs would be quite prohibitive. All LLM providers charge based on the number of input and output tokens. As you approach the context window limits for each product, of which we have millions, we can quickly run up cloud hosting bills in excess of six\u00a0figures.<\/p>\n<h3>Our Approach<\/h3>\n<p>To ship Review Summaries despite these technical, and financial, limitations, we focused on a rather simple insight into our data: Many reviews say the same thing. In fact, the whole idea of a summary relies on this: review summaries capture the recurring insights, themes, and sentiments of the reviewers. We realized that we can capitalize on this data duplication to reduce the amount of text we need to send to the LLM, saving us from hitting the context window limit <em>and<\/em> reducing the operating cost of our\u00a0system.<\/p>\n<p>To achieve this, we needed to identify segments of text that say the same thing. Such a task is easier said than done: often people use different words or phrases to express the same\u00a0thing.<\/p>\n<p>Fortunately, the task of identifying if text is semantically similar has been an active area of research in the natural language processing field. The work by Agirre et. al. 2013 (<em>SEM 2013 shared task: Semantic Textual Similarity. In Second Joint Conference on Lexical and Computational Semantics<\/em>) even published a human-labeled data of semantically similar sentences known as the STS Benchmark. In it, they ask humans to indicate if textual sentences are semantically similar or dissimilar on a scale of 1\u20135, as illustrated in the table below (from Cer et. al., <a href=\"https:\/\/aclanthology.org\/S17-2001\/\"><em>SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation<\/em><\/a>):<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/616\/0*JMgNQbYheovGnVwl\"><\/figure>\n<p>The STSBenchmark dataset is often used to evaluate how well a text embedding model can associate semantically similar sentences in its high-dimensional space. Specifically, Pearson\u2019s correlation is used to measure how well the embedding model represents the human judgements.<\/p>\n<p>Thus, we can use such an embedding model to identify semantically similar phrases from product reviews, and then remove repeated phrases before sending them to the\u00a0LLM.<\/p>\n<p>Our approach is as\u00a0follows:<\/p>\n<ul>\n<li>First, product reviews are segmented the into sentences.<\/li>\n<li>An embedding vector is computed for each sentence using a network that performs well on the STS benchmark<\/li>\n<li>Agglomerative clustering is used on all embedding vectors for each\u00a0product.<\/li>\n<li>An example sentence\u200a\u2014\u200athe one closest to the cluster centroid\u200a\u2014\u200ais retained from each cluster to send to the LLM, and other sentences within each cluster are\u00a0dropped.<\/li>\n<li>Any small clusters are considered outliers, and those are randomly sampled for inclusion in the\u00a0LLM.<\/li>\n<li>The number of sentences each cluster represents is included in the LLM prompt to ensure the weight of each sentiment is considered.<\/li>\n<\/ul>\n<p>This may seem straightforward when written in a bulleted list, but there were some devils in the details we had to sort out before we could trust this approach.<\/p>\n<h3>Embedding Model Evaluation<\/h3>\n<p>First, we had to ensure the model we used effectively embedded text in a space where semantically similar sentences are close, and semantically dissimilar ones are far away. To do this, we simply used the STS benchmark dataset and computed the Pearson correlation for the models we desired to consider. We use AWS as a cloud provider, so naturally we wanted to evaluate their <a href=\"https:\/\/docs.aws.amazon.com\/bedrock\/latest\/userguide\/titan-embedding-models.html\">Titan Text Embedding<\/a> models.<\/p>\n<p>Below is a table showing the Pearson\u2019s correlation on the STS Benchmark for different Titan Embedding models:<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?url=https%3A%2F%2Fdatawrapper.dwcdn.net%2F2aNSh%2F&amp;type=text%2Fhtml&amp;schema=dwcdn&amp;display_name=Datawrapper&amp;src=https%3A%2F%2Fdatawrapper.dwcdn.net%2F2aNSh%2F1%2F\" width=\"600\" height=\"284\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/bdfdeecddae18578dffbbfc75b88949c\/href\">https:\/\/medium.com\/media\/bdfdeecddae18578dffbbfc75b88949c\/href<\/a><\/iframe><\/p>\n<p>(State of the art is visible\u00a0<a href=\"https:\/\/paperswithcode.com\/sota\/semantic-textual-similarity-on-sts-benchmark\">here<\/a>)<\/p>\n<p>So AWS\u2019s embedding models are quite good at embedding semantically similar sentences. This was great news for us\u200a\u2014\u200awe can use these models off the shelf and their cost is extremely low.<\/p>\n<h3>Semantically Similar Clustering<\/h3>\n<p>The next challenge we faced was: how can we enforce semantic similarity during clustering? Ideally, no cluster would have two sentences whose semantic similarity is less than humans can accept\u200a\u2014\u200aa score of 4 in the table above. Those scores, however, do not directly translate to the embedding distances, which is what is needed for agglomerative clustering thresholds.<\/p>\n<p>To deal with this issue, we again turned to the STS benchmark dataset. We computed the distances for all pairs in the training dataset, and fit a polynomial from the scores to the distance thresholds.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*iDY9f8vRYO10pd1u\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<p>This polynomial lets us compute the distance threshold needed to meet any semantic similarity target. For Review Summaries, we selected a score of 3.5, so nearly all clusters contain sentences that are \u201croughly\u201d to \u201cmostly\u201d equivalent or\u00a0more.<\/p>\n<p>It\u2019s worth noting that this can be done on any embedding network. This lets us experiment with different embedding networks as they become available, and quickly swap them out should we desire without worrying that the clusters will have semantically dissimilar sentences.<\/p>\n<h3>Multi-Pass Clustering<\/h3>\n<p>Up to this point, we knew we could trust our semantic compression, but it wasn\u2019t clear how much compression we could get from our data. As expected, the amount of compression varied across different products, clients, and industries.<\/p>\n<p>Without loss of semantic information, i.e., a hard threshold of 4, we only achieved a compression ratio of 1.18 (i.e., a space savings of\u00a015%).<\/p>\n<p>Clearly lossless compression wasn\u2019t going to be enough to make this feature financially viable.<\/p>\n<p>Our distance selection approach discussed above, however, provided an interesting possibility here: we can slowly increase the amount of information loss by repeatedly running the clustering at lower thresholds for remaining data.<\/p>\n<p>The approach is as\u00a0follows:<\/p>\n<ul>\n<li>Run the clustering with a threshold selected from score = 4. This is considered lossless.<\/li>\n<li>Select any outlying clusters, i.e., those with only a few vectors. These are considered \u201cnot compressed\u201d and used for the next phase. We chose to re-run clustering on any clusters with size less than\u00a010.<\/li>\n<li>Run clustering again with a threshold selected from score = 3. This is not lossless, but not so\u00a0bad.<\/li>\n<li>Select any clusters with size less than\u00a010.<\/li>\n<li>Repeat as desired, continuously decreasing the score threshold.<\/li>\n<\/ul>\n<p>So, at each pass of the clustering, we\u2019re sacrificing more information loss, but getting more compression and not muddying the lossless representative phrases we selected during the first\u00a0pass.<\/p>\n<p>In addition, such an approach is extremely useful not only for Review Summaries, where we want a high level of semantic similarity at the cost of less compression, but for other use cases where we may care less about semantic information loss but desire to spend less on prompt\u00a0inputs.<\/p>\n<p>In practice, there are still a significantly large number of clusters with only a single vector in them even after dropping the score threshold a number of times. These are considered outliers, and are randomly sampled for inclusion in the final prompt. We select the sample size to ensure the final prompt has 25,000 tokens, but no\u00a0more.<\/p>\n<h3>Ensuring Authenticity<\/h3>\n<p>The multi-pass clustering and random outlier sampling permits semantic information loss in exchange for a smaller context window to send to the LLM. This raises the question: how good are our summaries?<\/p>\n<p>At Bazaarvoice, we know authenticity is a requirement for consumer trust, and our Review Summaries must stay authentic to truly represent all voices captured in the reviews. Any lossy compression approach runs the risk of mis-representing or excluding the consumers who took time to author a\u00a0review.<\/p>\n<p>To ensure our compression technique was valid, we measured this directly. Specifically, for each product, we sampled a number of reviews, and then used <a href=\"https:\/\/www.youtube.com\/watch?v=WWwYCAIYzQk\">LLM Evals<\/a> to identify if the summary was representative of and relevant to each review. This gives us a hard metric to evaluate and balance our compression against.<\/p>\n<h3>Results<\/h3>\n<p>Over the past 20 years, we have collected nearly a billion user-generated reviews and needed to generate summaries for tens of millions of products. Many of these products have thousands of reviews, and some up to millions, that would exhaust the context windows of LLMs and run the price up considerably.<\/p>\n<p>Using our approach above, however, we reduced the input text size by <strong>97.7%<\/strong> (a compression ratio of <strong>42<\/strong>), letting us scale this solution for all products and any amount of review volume in the future.<br \/>In addition, the cost of generating summaries for all of our billion-scale dataset reduced <strong>82.4<\/strong>%. This includes the cost of embedding the sentence data and storing them in a database.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=0b3e62b0c43a\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/semantically-compress-text-to-save-on-llm-costs-0b3e62b0c43a\">Semantically Compress Text to Save On LLM Costs<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Lou Kratz<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fsemantically-compress-text-to-save-on-llm-costs-0b3e62b0c43a\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Semantically Compress Text to Save On LLM Costs LLMs are great\u2026 if they can fit all of your\u00a0data Photo by Christopher Burns on\u00a0Unsplash Originally published at https:\/\/blog.developer.bazaarvoice.com on October 28,\u00a02024. Introduction Large language models are fantastic tools for unstructured text, but what if your text doesn\u2019t fit in the context window? Bazaarvoice faced exactly this [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,350,832,71,70,833],"tags":[835,836,834],"class_list":["post-728","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-clustering","category-cost-reduction","category-large-language-models","category-machine-learning","category-semantics","tag-our","tag-reviews","tag-text"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/728"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=728"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/728\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=728"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=728"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=728"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}