{"id":283,"date":"2024-11-30T07:02:40","date_gmt":"2024-11-30T07:02:40","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/11\/30\/how-did-open-food-facts-use-open-source-llms-to-enhance-ingredients-extraction-d74dfe02e0e4\/"},"modified":"2024-11-30T07:02:40","modified_gmt":"2024-11-30T07:02:40","slug":"how-did-open-food-facts-use-open-source-llms-to-enhance-ingredients-extraction-d74dfe02e0e4","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/11\/30\/how-did-open-food-facts-use-open-source-llms-to-enhance-ingredients-extraction-d74dfe02e0e4\/","title":{"rendered":"How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs?"},"content":{"rendered":"<p>    How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs?<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Delve into an end-to-end Machine Learning project to improve the quality of the Open Food Facts\u00a0database<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A-6WhpiGzUAb__bj8tmFwUQ.png?ssl=1\"><figcaption>Image generated with\u00a0Flux1<\/figcaption><\/figure>\n<p>Open Food Facts\u2019 purpose is to create the largest <strong>open-source food database in the world<\/strong>. To this day, it has collected over 3 millions products and their information thanks to its contributors.<\/p>\n<p>Nutritional value, eco-score, product origins,\u2026 Various data that define each product and give consumers and researchers insights about what they put in their\u00a0plates.<\/p>\n<p>This information is provided by the community of users and contributors, who actively add products data, take pictures, and fill any missing data into the database through the <a href=\"https:\/\/play.google.com\/store\/apps\/details?id=org.openfoodfacts.scanner&amp;hl=en_US&amp;pli=1\">mobile\u00a0app<\/a>.<\/p>\n<p>Using the product picture, Open Food Facts extracts the ingredients list, typically located on the back of the packaging, through Optical Character Recognition (OCR). The product composition is then parsed and added to the database.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/400\/0%2A095wbwP-o4N6ykFG.jpg?ssl=1\"><figcaption>List of ingredients on the product packaging<\/figcaption><\/figure>\n<p>However, it often appears that the text extraction doesn\u2019t go\u00a0well\u2026<\/p>\n<pre>Ingr\u00e9dients: Jambon do porc, sel, dextrose, ar\u00f4me naturels, antioxydant: E316, conservateur: E250<br>                     ^<\/pre>\n<p>These typos may seem minimal, but when the list is parsed to extract individual ingredients, <strong>such errors create unrecognized ingredients,<\/strong> which harm the quality of the database. Light reflections, folded packaging, low-quality pictures, and other factors all complicate the ingredient parsing\u00a0process.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/400\/0%2Az0OscAwbw4e7tzPM.jpg?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/300\/0%2AkcE1i9RReIsLpNDO.jpg?ssl=1\"><figcaption>Examples of packaging pictures where the OCR fails (from the Open Food Facts database)<\/figcaption><\/figure>\n<p><strong>Open Food Facts has tried to solve this issue for years using Regular Expressions and existing solutions such as Elasticsearch\u2019s corrector, without success. Until recently.<\/strong><\/p>\n<p>Thanks to the latest advancements in artificial intelligence, we now have access to powerful <strong>Large Language Models<\/strong>, also called\u00a0<strong>LLMs<\/strong>.<\/p>\n<p>By training our own model, we created the <strong>Ingredients Spellcheck <\/strong>and managed to not only outperform proprietary LLMs such as <strong>GPT-4o<\/strong> or <strong>Claude 3.5 Sonnet <\/strong>on this task, but also to reduce the number of unrecognized ingredients in the database by\u00a0<strong>11%<\/strong>.<\/p>\n<p><strong>This article walks you through the different stages of the project and shows you how we managed to improve the quality of the database using Machine Learning.<\/strong><\/p>\n<p>Enjoy the\u00a0reading!<\/p>\n<h3>Define the\u00a0problem<\/h3>\n<p>When a product is added by a contributor, its pictures go through a series of processes to extract all relevant information. One crucial step is the extraction of the <strong>list of ingredients<\/strong>.<\/p>\n<p>When a word is identified as an ingredient, it is cross-referenced with a <strong>taxonomy <\/strong>that contains a predefined list of recognized ingredients. If the word matches an entry in the taxonomy, it is tagged as an ingredient and added to the product\u2019s information.<\/p>\n<p>This tagging process ensures that ingredients are standardized and easily searchable, providing accurate data for consumers and analysis\u00a0tools.<\/p>\n<p><strong>But if an ingredient is not recognized, the process\u00a0fails.<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Alke6K2JoJHOEP9p4f-LurQ.png?ssl=1\"><figcaption>The ingredient \u201cJambon do porc\u201d (Pork ham) was not recognized by the parser (from the Product Edition\u00a0page)<\/figcaption><\/figure>\n<p>For this reason, we introduced an additional layer to the process: the <strong>Ingredients Spellcheck<\/strong>, designed to correct ingredient lists before they are processed by the ingredient parser.<\/p>\n<p>A simpler approach would be the <a href=\"https:\/\/norvig.com\/spell-correct.html\">Peter Norvig algorithm<\/a>, which processes each word by applying a series of character deletions, additions, and replacements to identify potential corrections.<\/p>\n<p>However, this method proved to be insufficient for our use case, for several\u00a0reasons:<\/p>\n<ul>\n<li>\n<strong>Special Characters and Formatting<\/strong>: Elements like commas, brackets, and percentage signs hold critical importance in ingredient lists, influencing product composition and allergen labeling <em>(e.g., \u201csalt (1.2%)\u201d).<\/em>\n<\/li>\n<li>\n<strong>Multilingual Challenges<\/strong>: the database contains products from all over the word with a wide variety of languages. This further complicates a basic character-based approach like Norvig\u2019s, which is language-agnostic.<\/li>\n<\/ul>\n<p>Instead, we turned to the latest advancements in Machine Learning, particularly <strong>Large Language Models (LLMs)<\/strong>, which excel in a wide variety of <strong>Natural Language Processing (NLP)<\/strong> tasks, including spelling correction.<\/p>\n<p>This is the path we decided to\u00a0take.<\/p>\n<h3>Evaluate<\/h3>\n<blockquote><p>You can\u2019t improve what you don\u2019t\u00a0measure.<\/p><\/blockquote>\n<p><strong>What is a good correction? And how to measure the performance of the corrector, LLM or\u00a0non-LLM?<\/strong><\/p>\n<p>Our first step is to understand and catalog the diversity of errors the Ingredient Parser encounters.<\/p>\n<p>Additionally, it\u2019s essential to assess whether an error should even be corrected in the first place. Sometimes, trying to correct mistakes could do more harm than\u00a0good:<\/p>\n<pre>flour, salt (1!2%)<br># Is it 1.2% or 12%?...<\/pre>\n<p>For these reasons, we created the <strong>Spellcheck Guidelines<\/strong>, a set of rules that limits the corrections. <em>These guidelines will serve us in many ways throughout the project, from the dataset generation to the model evaluation.<\/em><\/p>\n<p>The guidelines was notably used to create the <a href=\"https:\/\/huggingface.co\/datasets\/openfoodfacts\/spellcheck-benchmark\"><strong>Spellcheck Benchmark<\/strong><\/a>, a curated dataset containing approximately 300 lists of ingredients manually corrected.<\/p>\n<p>This benchmark is the <strong>cornerstone of the project<\/strong>. It enables us to evaluate any solution, Machine Learning or simple heuristic, on our use\u00a0case.<\/p>\n<p>It goes along the <strong>Evaluation algorithm<\/strong>, a custom solution we developed that transform a set of corrections into measurable metrics.<\/p>\n<h4>The Evaluation Algorithm<\/h4>\n<p>Most of the existing metrics and evaluation algorithms for text-relative tasks compute the similarity between a reference and a prediction, such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/BLEU\">BLEU<\/a> or <a href=\"https:\/\/en.wikipedia.org\/wiki\/ROUGE_(metric)\">ROUGE <\/a>scores for language translation or summarization.<\/p>\n<p>However, in our case, these metrics fail\u00a0short.<\/p>\n<p>We want to evaluate how well the Spellcheck algorithm recognizes and fixes the right words in a list of ingredients. Therefore, we adapt the <strong>Precision<\/strong> and <strong>Recall<\/strong> metrics for our\u00a0task:<\/p>\n<blockquote><p>\n<strong>Precision <\/strong>= Right corrections by the model \/ \u200bTotal corrections made by the\u00a0model<\/p><\/blockquote>\n<blockquote><p>\n<strong>Recall <\/strong>= Right corrections by the model \/ \u200bTotal number of\u00a0errors<\/p><\/blockquote>\n<p>However, we don\u2019t have the fine-grained view of which words were supposed to be corrected\u2026 We only have access\u00a0to:<\/p>\n<ul>\n<li><strong>The <em>original<\/em>: the list of ingredients as present in the database;<\/strong><\/li>\n<li><strong>The <em>reference<\/em>: how we expect this list to be corrected;<\/strong><\/li>\n<li><strong>The <em>prediction<\/em>: the correction from the\u00a0model.<\/strong><\/li>\n<\/ul>\n<blockquote><p>Is there any way to calculate the number of errors that were correctly corrected, the ones that were missed by the Spellcheck, and finally the errors that were wrongly corrected?<\/p><\/blockquote>\n<p>The answer is\u00a0yes!<\/p>\n<pre>Original:       \"Th cat si on the fride,\"<br>Reference:      \"The cat is on the fridge.\"<br>Prediction:     \"Th big cat is in the fridge.\"<\/pre>\n<p>With the example above, we can easily spot which words were supposed to be corrected: The\u00a0, is and fridge\u00a0; and which words were wrongly corrected: on into in. Finally, we see that an additional word was added: big\u00a0.<\/p>\n<p>If we align these 3 sequences in pairs, original-reference and original-prediction\u00a0, we can detect which words were supposed to be corrected, and those that weren\u2019t. This alignment problem is typical in bio-informatic, called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sequence_alignment\">Sequence Alignment<\/a>, whose purpose is to identify regions of similarity.<\/p>\n<p>This is a perfect analogy for our spellcheck evaluation task.<\/p>\n<pre>Original:       \"Th    -   cat   si   on   the   fride,\"<br>Reference:      \"The   -   cat   is   on   the   fridge.\"<br>                  1    0    0    1    0    0     1<br><br>Original:       \"Th    -   cat   si   on   the   fride,\"<br>Prediction:     \"Th   big  cat   is   in   the   fridge.\"<br>                  0    1    0    1    1    0     1<br>                FN    FP         TP   FP         TP<\/pre>\n<p>By labeling each pair with a 0 or 1 whether the word changed or not, we can calculate how often the model correctly fixes mistakes <strong>(True Positives\u200a\u2014\u200aTP)<\/strong>, incorrectly changes correct words <strong>(False Positives\u200a\u2014\u200aFP)<\/strong>, and misses errors that should have been corrected <strong>(False Negatives\u200a\u2014\u200aFN).<\/strong><\/p>\n<p>In other words, we can calculate the <strong>Precision <\/strong>and <strong>Recall <\/strong>of the Spellcheck!<\/p>\n<p>We now have a robust algorithm that is capable of evaluating any Spellcheck solution!<\/p>\n<p>You can find the algorithm in the <a href=\"https:\/\/github.com\/openfoodfacts\/openfoodfacts-ai\/blob\/develop\/spellcheck\/src\/spellcheck\/evaluation\/evaluator.py\">project repository<\/a>.<\/p>\n<h3>Large Language\u00a0Models<\/h3>\n<p>Large Language Models (LLMs) have proved being great help in tackling Natural Language task in various industries.<\/p>\n<p>They constitute a path we have to explore for our use\u00a0case.<\/p>\n<blockquote><p>Many LLM providers brag about the performance of their model on leaderboards, but how do they perform on correcting error in lists of ingredients? Thus, we evaluated them!<\/p><\/blockquote>\n<p>We evaluated <strong>GPT-3.5 <\/strong>and <strong>GPT-4o <\/strong>from <strong>OpenAI<\/strong>, <strong>Claude-Sonnet-3.5 <\/strong>from <strong>Anthropic<\/strong>, and <strong>Gemini-1.5-Flash<\/strong> from <strong>Google <\/strong>using our custom benchmark and evaluation algorithm.<\/p>\n<p>We prompted detailed instructions to orient the corrections towards our custom guidelines.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AgS97vKb5scseuWGdu4k57Q.png?ssl=1\"><figcaption>LLMs evaluation on our benchmark (image from\u00a0author)<\/figcaption><\/figure>\n<p><strong>GPT-3.5-Turbo<\/strong> delivered the best performance compared to other models, both in terms of metrics and manual review. Special mention goes to <strong>Claude-Sonnet-3.5<\/strong>, which showed impressive error corrections (high Recall), but often provided additional irrelevant explanations, lowering its Precision.<\/p>\n<p>Great! We have an LLM that works! Time to create the feature in the\u00a0app!<\/p>\n<p><strong><em>Well,<\/em><\/strong> not so\u00a0fast\u2026<\/p>\n<p>Using private LLMs reveals many challenges:<\/p>\n<ol>\n<li>\n<strong>Lack of Ownership<\/strong>: We become dependent on the providers and their models. New model versions are released frequently, altering the model\u2019s behavior. This instability, primarily because the model is designed for general purposes rather than our specific task, complicates long-term maintenance.<\/li>\n<li>\n<strong>Model Deletion Risk<\/strong>: We have no safeguards against providers removing older models. For instance, GPT-3.5 is slowly being replace by more performant models, despite being the best model for this\u00a0task!<\/li>\n<li>\n<strong>Performance Limitations<\/strong>: The performance of a private LLM is constrained by its prompts. In other words, our only way of improving outputs is through better prompts since we cannot modify the core weights of the model by training it on our own\u00a0data.<\/li>\n<\/ol>\n<p><strong><em>For these reasons, we chose to focus our efforts on open-source solutions that would provide us with complete control and outperform general\u00a0LLMs.<\/em><\/strong><\/p>\n<h3>Train our own\u00a0model<\/h3>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A01OI_L5U0UkasdviWDOPxg.png?ssl=1\"><figcaption>The model training workflow: from dataset extraction to model training (image from\u00a0author)<\/figcaption><\/figure>\n<p>Any machine learning solution starts with data. In our case, data is the corrected lists of ingredients.<\/p>\n<p>However, not all lists of ingredients are equal. Some are free of unrecognized ingredients, some are just so unreadable they would be no point correcting them.<\/p>\n<p>Therefore, we find a perfect balance by choosing lists of ingredients having between<strong> 10 and 40 percent of unrecognized ingredients<\/strong>. We also ensured there\u2019s no duplicate within the dataset, but also with the benchmark to prevent any data leakage during the evaluation stage.<\/p>\n<p>We extracted 6000 uncorrected lists from the Open Food Facts database using <a href=\"https:\/\/duckdb.org\/\">DuckDB<\/a>, a fast in-process SQL tool capable of processing millions of rows under the\u00a0second.<\/p>\n<p>However, those extracted lists are not corrected yet, and manually annotating them would take too much time and resources\u2026<\/p>\n<p><strong>However, we have access to LLMs we already evaluated on the exact task. Therefore, we prompted GPT-3.5-Turbo, the best model on our benchmark, to correct every list in respect of our guidelines.<\/strong><\/p>\n<p>The process took less than an hour and cost nearly\u00a0<strong>2$<\/strong>.<\/p>\n<p>We then manually reviewed the dataset using <a href=\"https:\/\/argilla.io\/\">Argilla<\/a>, an open-source annotation tool specialized in Natural Language Processing tasks. This process ensures the dataset is of sufficient quality to train a reliable\u00a0model.<\/p>\n<p><strong>We now have at our disposal a <\/strong><a href=\"https:\/\/huggingface.co\/datasets\/openfoodfacts\/spellcheck-dataset\"><strong>training dataset<\/strong><\/a><strong> and an <\/strong><a href=\"https:\/\/huggingface.co\/datasets\/openfoodfacts\/spellcheck-benchmark\"><strong>evaluation benchmark<\/strong><\/a><strong> to train our own model on the Spellcheck task.<\/strong><\/p>\n<h4>Training<\/h4>\n<p>For this stage, we decided to go with <strong>Sequence-to-Sequence Language Models<\/strong>. In other words, these models take a text as input and returns a text as output, which suits the spellcheck process.<\/p>\n<p>Several models fit this role, such as the <strong>T5 family<\/strong> developed by Google in 2020, or the current open-source LLMs such as <strong>Llama<\/strong> or <strong>Mistral<\/strong>, which are designed for text generation and following instructions.<\/p>\n<p>The model training consists in a succession of steps, each one requiring different resources allocations, such as cloud GPUs, data validation and logging. For this reason, we decided to orchestrate the training using <a href=\"https:\/\/metaflow.org\/\">Metaflow<\/a>, a pipeline orchestrator designed for Data science and Machine Learning projects.<\/p>\n<p>The training pipeline is composed as\u00a0follow:<\/p>\n<ul>\n<li>Configurations and hyperparameters are imported to the pipeline from config yaml\u00a0files;<\/li>\n<li>The training job is launched in the cloud using <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\">AWS Sagemaker<\/a>, along the set of model hyperparameters and the custom modules such as the evaluation algorithm. Once the job is done, the model artifact is stored in an AWS S3 bucket. All training details are tracked using <a href=\"https:\/\/www.comet.com\/site\/\">Comet\u00a0ML<\/a>;<\/li>\n<li>The fine-tuned model is then evaluated on the <strong>benchmark <\/strong>using the evaluation algorithm. Depending on the model sizem this process can be extremely long. Therefore, we used <a href=\"https:\/\/github.com\/vllm-project\/vllm\">vLLM<\/a>, a Python library designed to accelerates LLM inferences;<\/li>\n<li>The predictions against the benchmark, also stored in AWS S3, are sent to <a href=\"https:\/\/argilla.io\/\">Argilla<\/a> for human-evaluation.<\/li>\n<\/ul>\n<p>After iterating over and over between refining the data and the model training, we achieved performance <strong>comparable to proprietary LLMs <\/strong>on the Spellcheck task, scoring an F1-Score of\u00a0<strong>0.65<\/strong>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/935\/1%2AfdGAj-K2IiQVgPP6zMYzCg.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Am_EdG5UeBREYmKrVSr14pQ.png?ssl=1\"><figcaption>LLMs evaluation on our benchmark (image from\u00a0author)<\/figcaption><\/figure>\n<p>The model, a fine-tuned <a href=\"https:\/\/huggingface.co\/mistralai\/Mistral-7B-v0.3\">Mistral-7B-Base-v0.3<\/a>, is available on the Hugging Face platform and is publicly available, along its <a href=\"https:\/\/huggingface.co\/datasets\/openfoodfacts\/spellcheck-dataset\">dataset<\/a> and evaluation <a href=\"https:\/\/huggingface.co\/datasets\/openfoodfacts\/spellcheck-benchmark\">benchmark<\/a>.<\/p>\n<p><a href=\"https:\/\/huggingface.co\/openfoodfacts\/spellcheck-mistral-7b\">openfoodfacts\/spellcheck-mistral-7b \u00b7 Hugging Face<\/a><\/p>\n<p>Furthermore, <strong>we estimated the Spellcheck reduced the number of unrecognized ingredients by 11%, which is promising!<\/strong><\/p>\n<p>Now comes the final phase of the project: integrating the model into Open Food\u00a0Facts.<\/p>\n<h3>Deployment &amp; Integration<\/h3>\n<p><strong>Our model is\u00a0big!<\/strong><\/p>\n<p>7 billions parameters, which means <strong>14 GB <\/strong>of memory required to run it in <strong>float16<\/strong>, without considering the <strong>20% overhead\u00a0factor.<\/strong><\/p>\n<p>Additionally, large models often mean <strong>low throughput during inference<\/strong>, which can make them inappropriate for real-time serving. We need GPUs with large memory to run this model in production, such as the <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/l4\/\">Nvidia L4<\/a>, which is equipped with 24GB of\u00a0VRAM.<\/p>\n<p><strong>But the price of running these instances in the cloud is quite expensive\u2026<\/strong><\/p>\n<p>However, a possibility to provide a real-time experience for our users, without requiring GPU instances running 24\/7, is <strong>batch inference<\/strong>.<\/p>\n<p>Lists of ingredients are processed in batches by the model on a regular basis, then stored in the database. <strong>This way, we pay only for the resources used during the batch processing!<\/strong><\/p>\n<h4>Batch Job<\/h4>\n<p>We developed a batch processing system to handle large-scale text processing using LLMs efficiently with <a href=\"https:\/\/cloud.google.com\/batch\/docs\/get-started\">Google Batch\u00a0Job<\/a>.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/729\/0*nevKLqwBo6wmAOQz\"><figcaption>Batch processing system (image from\u00a0author)<\/figcaption><\/figure>\n<p>The process begins by extracting data from the Open Food Facts database using DuckDB, which processes <strong>43 GB of data in under 2\u00a0minutes!<\/strong><\/p>\n<p>The extracted data is then sent to a Google Bucket, triggering a Google Batch\u00a0Job.<\/p>\n<p>This job uses a pre-prepared Docker image containing all necessary dependencies and algorithms. To optimize the resource-intensive LLM processing, we reuse vLLM achieving impressive performances, <strong>correcting 10,000 lists of ingredients in 20 minutes only with a GPU\u00a0L4<\/strong>!<\/p>\n<p>After successful processing, the corrected data is saved in a intermediate database containing the predictions of all models in Open Food Facts, served by <a href=\"https:\/\/github.com\/openfoodfacts\/robotoff\">Robotoff<\/a>.<\/p>\n<p>When a contributor modifies a product details, they\u2019re presented with the spellcheck corrections, <strong>ensuring users remain the key decision-makers in Open Food Facts\u2019 data quality\u00a0process.<\/strong><\/p>\n<p>This system allows Open Food Facts to leverage advanced AI capabilities for improving data quality while preserving its community-driven approach.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A5o9gFYzHlgBTnutt1MnE1g.png?ssl=1\"><figcaption>Batch jobs in GCP (image from\u00a0author)<\/figcaption><\/figure>\n<h3>Conlusion<\/h3>\n<p>In this article, we walked you through the development and the integration of the<strong> Ingredients Spellcheck<\/strong>, an LLM-powered up feature to correct OCR-extracted lists of ingredients.<\/p>\n<p>We first developed a set of rules, the <strong>Spellcheck Guidelines<\/strong>, to restrict the corrections\u00a0. We created a benchmark of corrected lists of ingredients that, along a custom evaluation algorithm, to evaluate any solution to the\u00a0problem.<\/p>\n<p>With this setup, we evaluated various private LLMs and determined that <strong>GPT-3.5-Turbo<\/strong> was the most suitable model for our specific use case. However, we also demonstrated that relying on a private LLM imposes significant limitations, including lack of ownership and restricted opportunities to improve or fine-tune such a large model (175 billion parameters).<\/p>\n<p>To address these challenges, we decided to <strong>develop our own model<\/strong>, fine-tuning it on synthetically corrected texts extracted from the database. After several iterations and experiments, we successfully achieved good performances with an <strong>open-source model<\/strong>. Not only did we match private LLMs performance, it also solved the ownership problem we were facing, giving us full control over our\u00a0model.<\/p>\n<p>We then <strong>integrated this model into the Open Food Facts <\/strong>using <strong>batch inference deployment, <\/strong>enabling us to process thousands of lists on a regular basis. The predictions are stored in Robotoff database as <strong>Insights <\/strong>before being validated by contributors, <strong>leaving OFF data quality ownership to contributors<\/strong>.<\/p>\n<h3>Next step<\/h3>\n<p><strong>The Spellcheck integration is still a work in progress<\/strong>. We are working on designing the user interface to propose ML generated corrections and let contributors accept, deny, or modify corrections. <strong>We expect fully integrating the feature by the end of the\u00a0year.<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AdCG8KVAH_jLmapejAprfRQ.png?ssl=1\"><figcaption>Spellcheck corrections validated by users built in <a href=\"https:\/\/huggingface.co\/spaces\/openfoodfacts\/ingredients-spellcheck-annotate\">Hugging Face\u00a0Space<\/a><\/figcaption><\/figure>\n<p>Additionally, we plan to continue refining the model through iterative improvements. Its performance can be significantly enhanced by improving the quality of the training data and incorporating user feedback. This approach will allow us to fine-tune the model continuously, ensuring it remains highly effective and aligned with real-world use\u00a0cases.<\/p>\n<p>The model, along its datasets, can be find in the official <a href=\"https:\/\/huggingface.co\/openfoodfacts\">Hugging Face repository<\/a>. The code used to developped this model is available in the <a href=\"https:\/\/github.com\/openfoodfacts\/openfoodfacts-ai\/tree\/develop\/spellcheck\">OpenFoodFacts-ai\/spellcheck<\/a> Github repository.<\/p>\n<p>Thank you for reading that far! We hope you enjoyed the\u00a0reading.<\/p>\n<p>If you too, you want to contribute to Open Food Facts, you\u00a0can:<\/p>\n<ul>\n<li>Contribute to the Open Food Facts <a href=\"https:\/\/github.com\/openfoodfacts\">GitHub<\/a>: explore open issues that align with your\u00a0skills,<\/li>\n<li>Download the Open Food Facts <a href=\"https:\/\/world.openfoodfacts.org\/open-food-facts-mobile-app?utm_source=off&amp;utf_medium=web&amp;utm_campaign=search_and_links_promo_en\">mobile app<\/a>: add new products to the database or improve existing ones by simply scanning their barcodes,<\/li>\n<li>Join the Open Food Facts <a href=\"https:\/\/slack.openfoodfacts.org\/\">Slack<\/a> and start discussing with other contributors in the OFF community.<\/li>\n<\/ul>\n<p>We can\u2019t wait to see you join the community!<\/p>\n<p>Don\u2019t hesitate to check our other articles:<\/p>\n<p><a href=\"https:\/\/medium.com\/@jeremyarancio\/duckdb-open-food-facts-the-largest-open-food-database-in-the-palm-of-your-hand-0d4ab30d0701\">DuckDB &amp; Open Food Facts: the largest open food database in the palm of your hand \ud83e\udd86\ud83c\udf4a<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=d74dfe02e0e4\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-did-open-food-facts-use-open-source-llms-to-enhance-ingredients-extraction-d74dfe02e0e4\">How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs?<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jeremy Arancio<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-did-open-food-facts-use-open-source-llms-to-enhance-ingredients-extraction-d74dfe02e0e4\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs? Delve into an end-to-end Machine Learning project to improve the quality of the Open Food Facts\u00a0database Image generated with\u00a0Flux1 Open Food Facts\u2019 purpose is to create the largest open-source food database in the world. To this day, it has collected over 3 millions products [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,87,70,260,259],"tags":[261,262,9],"class_list":["post-283","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-llm","category-machine-learning","category-nlp","category-ocr","tag-food","tag-ingredients","tag-open"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/283"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=283"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/283\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=283"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=283"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=283"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}