{"id":401,"date":"2024-12-06T07:00:28","date_gmt":"2024-12-06T07:00:28","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/06\/multimodal-rag-process-any-file-type-with-ai-e6921342c903\/"},"modified":"2024-12-06T07:00:28","modified_gmt":"2024-12-06T07:00:28","slug":"multimodal-rag-process-any-file-type-with-ai-e6921342c903","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/06\/multimodal-rag-process-any-file-type-with-ai-e6921342c903\/","title":{"rendered":"Multimodal RAG: Process Any File Type with AI"},"content":{"rendered":"<p>    Multimodal RAG: Process Any File Type with AI<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>A beginner-friendly guide with example (Python)\u00a0code<\/h4>\n<p>This is the third article in a <a href=\"https:\/\/shawhin.medium.com\/list\/multimodal-ai-fe9521d0e77a\">larger series<\/a> on multimodal AI. In the previous posts, we discussed <a href=\"https:\/\/towardsdatascience.com\/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3\">multimodal LLMs<\/a> and <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">embedding models<\/a>, respectively. In this article, we will combine these ideas to enable the development of multimodal RAG systems. I\u2019ll start by reviewing key concepts and then share example code for implementing such a\u00a0system.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AKUfCT0odhqkTsPzf1ljx6A.png?ssl=1\"><figcaption>Image from\u00a0Canva.<\/figcaption><\/figure>\n<p>Language models like GPT, LLaMA, and Claude learn a tremendous amount of world knowledge via their pre-training. This makes them powerful tools for solving custom problems and answering complex questions.<\/p>\n<p>However, <strong>there is knowledge that even the most advanced language models are ignorant of<\/strong>. This includes proprietary information within organizations, events that occurred after a model&#8217;s pre-training data collection, and specialized knowledge that is not prevalent on the internet.<\/p>\n<p>Although this ignorance limits a model\u2019s out-of-the-box capabilities, there is <strong>a popular technique to overcome these limitations<\/strong>: retrieval augmented generation (or RAG for\u00a0short).<\/p>\n<h3><strong>What is\u00a0RAG?<\/strong><\/h3>\n<p><strong>RAG<\/strong> is an approach for<strong> improving a model\u2019s response quality by dynamically providing the relevant context<\/strong> for a given prompt. Here\u2019s an example of when this might be\u00a0helpful.<\/p>\n<p>Say, I forgot the name of a Python library a colleague mentioned in yesterday\u2019s meeting. This isn\u2019t something ChatGPT can help me with because it does not know the meeting\u2019s contents.<\/p>\n<p>However, RAG could help with this by taking my question (e.g. \u201cWhat was the name of that Python library that Rachel mentioned in yesterday\u2019s meeting?\u201d), automatically pulling the meeting transcript, then providing my original query and the transcript to an\u00a0LLM.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ASTVyqpJkhoKZWYdR-2-xqA.png?ssl=1\"><figcaption>Basic design of a RAG system. Image by\u00a0author.<\/figcaption><\/figure>\n<h3><strong>Multimodal RAG<\/strong><\/h3>\n<p>Although improving LLMs with RAG unlocks several practical use cases, there are some situations where relevant information exists in non-text formats, e.g., images, videos, charts, and tables. In such cases, we can go one step further and build <strong>multimodal RAG systems<\/strong>, <strong>AI systems capable of processing text and non-text\u00a0data<\/strong>.<\/p>\n<p>Multimodal RAG enables more sophisticated inferences beyond what is conveyed by text alone. For example, it could analyze someone\u2019s facial expressions and speech tonality to give a richer context to a meeting\u2019s transcription.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AmqRCTYThFcZmGcVtw6s1cw.png?ssl=1\"><figcaption>Basic design of a Multimodal RAG system. Image by\u00a0author.<\/figcaption><\/figure>\n<h3><strong>3 Levels of\u00a0MRAG<\/strong><\/h3>\n<p>While there are several ways to implement a multimodal RAG (MRAG) system, here I will focus on three basic strategies at increasing levels of sophistication.<\/p>\n<ol>\n<li>Translate modalities to\u00a0text.<\/li>\n<li>Text-only retrieval +\u00a0MLLM<\/li>\n<li>Multimodal retrieval +\u00a0MLLM<\/li>\n<\/ol>\n<p>The following discussion assumes <strong>you already have a basic understanding of RAG and multimodal models<\/strong>. The following articles discussed these topics: <a href=\"https:\/\/towardsdatascience.com\/how-to-improve-llms-with-rag-abdc132f76ac\">RAG<\/a>, <a href=\"https:\/\/towardsdatascience.com\/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3\">Multimodal LLMs<\/a>, and <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">Multimodal Embeddings<\/a>.<\/p>\n<h4><strong>Level 1: Translate modalities to\u00a0text<\/strong><\/h4>\n<p>A simple way to make a RAG system multimodal is by <strong>translating new modalities to text before storing them in the knowledge base<\/strong>. This could be as simple as converting meeting recordings into text transcripts, using an existing multimodal LLM (MLLM) to generate image captions, or converting tables to a readable text format (e.g.,\u00a0.csv or\u00a0.json).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/839\/1%2A7QqhRIlnU7TQsCMnVDb6KA.png?ssl=1\"><figcaption>Visual overview of Level 1 of MRAG. Image by\u00a0author.<\/figcaption><\/figure>\n<p>The key upside of this approach is that it <strong>requires minimal changes to an existing RAG system<\/strong>. Additionally, by explicitly generating text representations of non-text modalities, one has better control over the features of the data to extract. For instance, captions of analytical figures may include both a description and key insights.<\/p>\n<p>Of course, the downside of this strategy is that the <strong>model\u2019s responses cannot directly use non-textual data<\/strong>, which means that the translation from, say, image to text can create a critical information bottleneck.<\/p>\n<h4><strong>Level 2: Text-only retrieval +\u00a0MLLM<\/strong><\/h4>\n<p>Another approach is to generate text representations of all items in the knowledge base, e.g., descriptions and meta-tags, for retrieval, but to <strong>pass the original modality to a multimodal LLM (MLLM)<\/strong>. For example, image metadata is used for the retrieval step, and the associated image is passed to a model for inference.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/825\/1%2AJoUZLYezY3q95zngSmDJIA.png?ssl=1\"><figcaption>Visual overview of Level 2 of MRAG. Image by\u00a0author.<\/figcaption><\/figure>\n<p>This maintains many of the benefits of Level 1 while mitigating its limitations. Namely, text features of items in the knowledge base can be optimized for search, but the downstream model can use the full richness of each item\u2019s original modality.<\/p>\n<p>The key difference with this approach is that it requires an <strong>MLLM<\/strong>, which is <strong>an LLM capable of processing non-text data<\/strong>. This unlocks more advanced reasoning capabilities, as demonstrated by models like GPT-4o or LLaMA 3.2\u00a0Vision.<\/p>\n<h4><strong>Level 3: Multimodal retrieval +\u00a0MLLM<\/strong><\/h4>\n<p>Although we could use keyword-based search in the retrieval processes for Level 1 and Level 2, it is a common practice to use so-called <strong>vector search<\/strong>. This consists of <strong>generating vector representations (i.e., embeddings)<\/strong> of items in the knowledge base and then <strong>performing a search by computing similarity scores<\/strong> between an input query and each item in the knowledge base.<\/p>\n<p>Traditionally, this requires that the query and knowledge base items are text-based. However, as we saw in the <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">previous article<\/a> of this series, there exist <strong>multimodal embedding models<\/strong> that <strong>generate aligned vector representations of both text and non-text\u00a0data<\/strong>.<\/p>\n<p>Therefore, we can use multimodal embeddings to perform multimodal retrieval. This works the same way as text-based vector search, but now the embedding space co-locates similar concepts independent of its original modality. The results of such a retrieval strategy can then be passed directly to a\u00a0MLLM.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/839\/1%2AYwMdXXTGBMj9QSjAwkojdA.png?ssl=1\"><figcaption>Visual overview of Level 3 of MRAG. Image by\u00a0author.<\/figcaption><\/figure>\n<h3><strong>Example Code: Multimodal Blog Question-Answering Assistant<\/strong><\/h3>\n<p>With a basic understanding of how Multimodal RAG works, let\u2019s see how we can build such a system. Here, I will create a question-answering assistant that can access the text and figures from the previous two blogs in this\u00a0series.<\/p>\n<p>The Python code for this example is freely available at the <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/tree\/main\/multimodal-ai\/3-multimodal-rag\">GitHub\u00a0repo<\/a>.<\/p>\n<h4>Imports &amp; Data\u00a0Loading<\/h4>\n<p>We start by importing a few handy libraries and\u00a0modules.<\/p>\n<pre>import json<br>from transformers import CLIPProcessor, CLIPTextModelWithProjection<br>from torch import load, matmul, argsort<br>from torch.nn.functional import softmax<\/pre>\n<p>Next, we\u2019ll import text and image chunks from the <a href=\"https:\/\/towardsdatascience.com\/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3\">Multimodal LLMs<\/a> and <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">Multimodal Embeddings<\/a> blog posts. These are saved in\u00a0.json files, which can be loaded into Python as a list of dictionaries.<\/p>\n<pre># load text chunks<br>with open('data\/text_content.json', 'r', encoding='utf-8') as f:<br>        text_content_list = json.load(f)<br><br># load images<br>with open('data\/image_content.json', 'r', encoding='utf-8') as f:<br>        image_content_list = json.load(f)<\/pre>\n<p>While I won\u2019t review the data preparation process here, the code I used is on the <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/blob\/main\/multimodal-ai\/3-multimodal-rag\/1-data_prep.ipynb\">GitHub\u00a0repo<\/a>.<\/p>\n<p>We will also load the multimodal embeddings (from CLIP) for each item in <em>text_content_list <\/em>and<em> image_content_list<\/em>. These are saved as pytorch\u00a0tensors.<\/p>\n<pre># load embeddings<br>text_embeddings = load('data\/text_embeddings.pt', weights_only=True)<br>image_embeddings = load('data\/image_embeddings.pt', weights_only=True)<br><br>print(text_embeddings.shape)<br>print(image_embeddings.shape)<br><br># &gt;&gt; torch.Size([86, 512])<br># &gt;&gt; torch.Size([17, 512])<\/pre>\n<p>Printing the shape of these tensors, we see they are represented via 512-dimensional embeddings. And we have 86 text chunks and 17\u00a0images.<\/p>\n<h4>Multimodal Search<\/h4>\n<p>With our knowledge base loaded, we can now define a query for vector search. This will consist of translating an input query into an embedding using CLIP. We do this similarly to the examples from the <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">previous\u00a0post<\/a>.<\/p>\n<pre># query<br>query = \"What is CLIP's contrastive loss function?\"<br><br># embed query (4 steps)<br># 1) load model<br>model = CLIPTextModelWithProjection.from_pretrained(\"openai\/clip-vit-base-patch16\")<br># 2) load data processor<br>processor = CLIPProcessor.from_pretrained(\"openai\/clip-vit-base-patch16\")<br># 3) pre-process text<br>inputs = processor(text=[text], return_tensors=\"pt\", padding=True)<br># 4) compute embeddings with CLIP<br>outputs = model(**inputs)<br><br># extract embedding<br>query_embed = outputs.text_embeds<br>print(query_embed.shape)<br><br># &gt;&gt; torch.Size([1, 512])<\/pre>\n<p>Printing the shape, we see we have a single vector representing the\u00a0query.<\/p>\n<p>To perform a vector search over the knowledge base, we need to do the following.<\/p>\n<ol>\n<li>Compute similarities between the query embedding and all the text and image embeddings.<\/li>\n<li>Rescale the similarities to range from 0 to 1 via the softmax function.<\/li>\n<li>Sort the scaled similarities and return the top k\u00a0results.<\/li>\n<li>Finally, filter the results to only keep items above a pre-defined similarity threshold.<\/li>\n<\/ol>\n<p>Here\u2019s what that looks like in code for the text\u00a0chunks.<\/p>\n<pre># define k and simiarlity threshold<br>k = 5<br>threshold = 0.05<br><br># multimodal search over articles<br>text_similarities = matmul(query_embed, text_embeddings.T)<br><br># rescale similarities via softmax<br>temp=0.25<br>text_scores = softmax(text_similarities\/temp, dim=1)<br><br># return top k filtered text results<br>isorted_scores = argsort(text_scores, descending=True)[0]<br>sorted_scores = text_scores[0][isorted_scores]<br><br>itop_k_filtered = [idx.item() <br>                    for idx, score in zip(isorted_scores, sorted_scores) <br>                    if score.item() &gt;= threshold][:k]<br>top_k = [text_content_list[i] for i in itop_k_filtered]<br><br>print(top_k)<\/pre>\n<pre># top k results<br><br>[{'article_title': 'Multimodal Embeddings: An Introduction',<br>  'section': 'Contrastive Learning',<br>  'text': 'Two key aspects of CL contribute to its effectiveness'}]<\/pre>\n<p>Above, we see the top text results. Notice we only have one item, even though <em>k<\/em>=5. This is because the 2nd-5th items were below the 0.1 threshold.<\/p>\n<p>Interestingly, this item doesn\u2019t seem helpful to our initial query of <em>\u201cWhat is CLIP\u2019s contrastive loss function?\u201d<\/em> This highlights <strong>one of the key challenges of vector search<\/strong>: <em>items similar to a given query may not necessarily help answer\u00a0it<\/em>.<\/p>\n<p>One way we can mitigate this issue is having less stringent restrictions on our search results by increasing <em>k<\/em> and lowering the similarity <em>threshold<\/em>, then hoping the LLM can work out what\u2019s helpful vs.\u00a0not.<\/p>\n<p>To do this, I\u2019ll first package the vector search steps into a Python function.<\/p>\n<pre>def similarity_search(query_embed, target_embeddings, content_list, <br>                      k=5, threshold=0.05, temperature=0.5):<br>    \"\"\"<br>       Perform similarity search over embeddings and return top k results.<br>    \"\"\"<br>    # Calculate similarities<br>    similarities = torch.matmul(query_embed, target_embeddings.T)<br>    <br>    # Rescale similarities via softmax<br>    scores = torch.nn.functional.softmax(similarities\/temperature, dim=1)<br>    <br>    # Get sorted indices and scores<br>    sorted_indices = scores.argsort(descending=True)[0]<br>    sorted_scores = scores[0][sorted_indices]<br>    <br>    # Filter by threshold and get top k<br>    filtered_indices = [<br>        idx.item() for idx, score in zip(sorted_indices, sorted_scores) <br>        if score.item() &gt;= threshold<br>    ][:k]<br>    <br>    # Get corresponding content items and scores<br>    top_results = [content_list[i] for i in filtered_indices]<br>    result_scores = [scores[0][i].item() for i in filtered_indices]<br>    <br>    return top_results, result_scores<\/pre>\n<p>Then, set more inclusive search parameters.<\/p>\n<pre># search over text chunks<br>text_results, text_scores = similarity_search(query_embed, text_embeddings, <br>                    text_content_list, k=15, threshold=0.01, temperature=0.25)<br><br># search over images<br>image_results, image_scores = similarity_search(query_embed, image_embeddings, <br>                    image_content_list, k=5, threshold=0.25, temperature=0.5)<\/pre>\n<p>This results in 15 text results and 1 image\u00a0result.<\/p>\n<pre>1 - Two key aspects of CL contribute to its effectiveness<br>2 - To make a class prediction, we must extract the image logits and evaluate <br>which class corresponds to the maximum.<br>3 - Next, we can import a version of the clip model and its associated data <br>processor. Note: the processor handles tokenizing input text and image <br>preparation.<br>4 - The basic idea behind using CLIP for 0-shot image classification is to <br>pass an image into the model along with a set of possible class labels. Then, <br>a classification can be made by evaluating which text input is most similar to <br>the input image.<br>5 - We can then match the best image to the input text by extracting the text <br>logits and evaluating the image corresponding to the maximum.<br>6 - The code for these examples is freely available on the GitHub repository.<br>7 - We see that (again) the model nailed this simple example. But let\u2019s try <br>some trickier examples.<br>8 - Next, we\u2019ll preprocess the image\/text inputs and pass them into the model.<br>9 - Another practical application of models like CLIP is multimodal RAG, which <br>consists of the automated retrieval of multimodal context to an LLM. In the <br>next article of this series, we will see how this works under the hood and <br>review a concrete example.<br>10 - Another application of CLIP is essentially the inverse of Use Case 1. <br>Rather than identifying which text label matches an input image, we can <br>evaluate which image (in a set) best matches a text input (i.e. query)\u2014in <br>other words, performing a search over images.<br>11 - This has sparked efforts toward expanding LLM functionality to include <br>multiple modalities.<br>12 - GPT-4o \u2014 Input: text, images, and audio. Output: text.FLUX \u2014 Input: text. <br>Output: images.Suno \u2014 Input: text. Output: audio.<br>13 - The standard approach to aligning disparate embedding spaces is <br>contrastive learning (CL). A key intuition of CL is to represent different <br>views of the same information similarly [5].<br>14 - While the model is less confident about this prediction with a 54.64% <br>probability, it correctly implies that the image is not a meme.<br>15 - [8] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex <br>Capabilities<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/800\/1%2Arq89PAcqQ_lHgkYhkf5T4g.png?ssl=1\"><figcaption>Image search\u00a0result.<\/figcaption><\/figure>\n<h4>Prompting MLLM<\/h4>\n<p>Although most of these text item results do not seem helpful to our query, the image result is exactly what we\u2019re looking for. Nevertheless, given these search results, let\u2019s see how LLaMA 3.2 Vision responds to this\u00a0query.<\/p>\n<p>We first will structure the search results as well-formatted strings.<\/p>\n<pre>text_context = \"\"<br>for text in text_results:<br>    if text_results:<br>        text_context = text_context + \"**Article title:** \" <br>                                              + text['article_title'] + \"n\"<br>        text_context = text_context + \"**Section:**  \" <br>                                              + text['section'] + \"n\"<br>        text_context = text_context + \"**Snippet:** \" <br>                                              + text['text'] + \"nn\"<\/pre>\n<pre>image_context = \"\"<br>for image in image_results:<br>    if image_results:<br>        image_context = image_context + \"**Article title:** \" <br>                                          + image['article_title'] + \"n\"<br>        image_context = image_context + \"**Section:**  \" <br>                                          + image['section'] + \"n\"<br>        image_context = image_context + \"**Image Path:**  \" <br>                                          + image['image_path'] + \"n\"<br>        image_context = image_context + \"**Image Caption:** \" <br>                                          + image['caption'] + \"nn\"<\/pre>\n<p>Note the metadata that accompanies each text and image item. This will help the LLaMA better understand the context of the\u00a0content.<\/p>\n<p>Next, we interleave the text and image results in a\u00a0prompt.<\/p>\n<pre># construct prompt template<br>prompt = f\"\"\"Given the query \"{query}\" and the following relevant snippets:<br><br>{text_context}<br>{image_context}<br><br>Please provide a concise and accurate answer to the query, incorporating <br>relevant information from the provided snippets where possible.<br><br>\"\"\"<\/pre>\n<p>The final prompt is quite long, so I won\u2019t print it here. However, it is fully displayed in the <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/blob\/main\/multimodal-ai\/3-multimodal-rag\/2-mrag_example.ipynb\">example notebook<\/a> on\u00a0GitHub.<\/p>\n<p>Finally, we can use <a href=\"https:\/\/ollama.com\/\">ollama<\/a> to pass this prompt to LLaMA 3.2\u00a0Vision.<\/p>\n<pre>ollama.pull('llama3.2-vision')<br><br>response = ollama.chat(<br>    model='llama3.2-vision',<br>    messages=[{<br>        'role': 'user',<br>        'content': prompt,<br>        'images': [image[\"image_path\"] for image in image_results]<br>    }]<br>)<br><br>print(response['message']['content'])<\/pre>\n<pre>The image depicts a contrastive loss function for aligning text and image <br>representations in multimodal models. The function is designed to minimize the <br>difference between the similarity of positive pairs (text-image) and negative <br>pairs (text-text or image-image). This loss function is commonly used in CLIP, <br>which stands for Contrastive Language-Image Pre-training.<br><br>**Key Components:**<br><br>*   **Positive Pairs:** Text-image pairs where the text describes an image.<br>*   **Negative Pairs:** Text-text or image-image pairs that do not belong to <br>the same class.<br>*   **Contrastive Loss Function:** Calculates the difference between positive <br>and negative pairs' similarities.<br><br>**How it Works:**<br><br>1.  **Text-Image Embeddings:** Generate embeddings for both text and images <br>using a multimodal encoder (e.g., CLIP).<br>2.  **Positive Pair Similarity:** Calculate the similarity score between each <br>text-image pair.<br>3.  **Negative Pair Similarity:** Calculate the similarity scores between all <br>negative pairs.<br>4.  **Contrastive Loss Calculation:** Compute the contrastive loss by <br>minimizing the difference between positive and negative pairs' similarities.<br><br>**Benefits:**<br><br>*   **Multimodal Alignment:** Aligns text and image representations for better <br>understanding of visual content from text descriptions.<br>*   **Improved Performance:** Enhances performance in downstream tasks like <br>image classification, retrieval, and generation.<\/pre>\n<p>The model correctly picks up that the image contains the information it needs and explains the general intuition of how it works. However, it <strong>misunderstands the meaning of positive and negative pairs<\/strong>, thinking that a negative pair corresponds to a pair of the same modality.<\/p>\n<p>While we went through the implementation details step-by-step, I packaged everything into a nice UI using Gradio in this <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/blob\/main\/multimodal-ai\/3-multimodal-rag\/3-mrag_UI.ipynb\">notebook<\/a> on the GitHub\u00a0repo.<\/p>\n<p><a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/tree\/main\/multimodal-ai\/3-multimodal-rag\">YouTube-Blog\/multimodal-ai\/3-multimodal-rag at main \u00b7 ShawhinT\/YouTube-Blog<\/a><\/p>\n<h3>Conclusion<\/h3>\n<p>Multimodal RAG systems can synthesize knowledge stored in a variety of formats, expanding what\u2019s possible with AI. Here, we reviewed 3 simple strategies for developing such a system and then saw an example implementation of a multimodal blog QA assistant.<\/p>\n<p>Although the example worked well enough for this demonstration, there are clear limitations to the search process. A few techniques that may improve this include using a <strong>reranker to refine similarity search<\/strong> results and to improve search quality via <strong>fine-tuned multimodal embeddings<\/strong>.<\/p>\n<p>If you want to see future posts on these topics, let me know in the comments\u00a0\ud83d\ude42<\/p>\n<p><strong>More on Multimodal models\u00a0\ud83d\udc47<\/strong><\/p>\n<p><a href=\"https:\/\/shawhin.medium.com\/list\/fe9521d0e77a\">Multimodal AI<\/a><\/p>\n<p><strong>My website<\/strong>: <a href=\"https:\/\/www.shawhintalebi.com\/\">https:\/\/www.shawhintalebi.com\/<\/a><\/p>\n<p>[1] <a href=\"https:\/\/towardsdatascience.com\/how-to-improve-llms-with-rag-abdc132f76ac\">RAG<\/a><\/p>\n<p>[2] <a href=\"https:\/\/towardsdatascience.com\/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3\">Multimodal LLMs<\/a><\/p>\n<p>[3] <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f\">Multimodal Embeddings<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=e6921342c903\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/multimodal-rag-process-any-file-type-with-ai-e6921342c903\">Multimodal RAG: Process Any File Type with AI<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shaw Talebi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fmultimodal-rag-process-any-file-type-with-ai-e6921342c903\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Multimodal RAG: Process Any File Type with AI A beginner-friendly guide with example (Python)\u00a0code This is the third article in a larger series on multimodal AI. In the previous posts, we discussed multimodal LLMs and embedding models, respectively. In this article, we will combine these ideas to enable the development of multimodal RAG systems. I\u2019ll [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,166,70,475,157],"tags":[98,476,362],"class_list":["post-401","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-hands-on-tutorials","category-machine-learning","category-multimodal-rag","category-python","tag-ai","tag-multimodal","tag-rag"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/401"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=401"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/401\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=401"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=401"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=401"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}