{"id":1588,"date":"2025-02-01T07:03:28","date_gmt":"2025-02-01T07:03:28","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/01\/fine-tuning-multimodal-embedding-models-bf007b1c5da5\/"},"modified":"2025-02-01T07:03:28","modified_gmt":"2025-02-01T07:03:28","slug":"fine-tuning-multimodal-embedding-models-bf007b1c5da5","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/01\/fine-tuning-multimodal-embedding-models-bf007b1c5da5\/","title":{"rendered":"Fine-tuning Multimodal Embedding Models"},"content":{"rendered":"<p>    Fine-tuning Multimodal Embedding Models<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Adapting CLIP to YouTube Data (with Python\u00a0Code)<\/h4>\n<p>This is the 4th article in a larger series on <a href=\"https:\/\/shawhin.medium.com\/list\/multimodal-ai-fe9521d0e77a\">multimodal AI<\/a>. In the previous post, we discussed <a href=\"https:\/\/towardsdatascience.com\/multimodal-rag-process-any-file-type-with-ai-e6921342c903\">multimodal RAG<\/a> systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a system using CLIP. One issue with this approach, however, is that vector search results from a general-purpose embedding model (like CLIP) <strong>may perform poorly in domain-specific use cases<\/strong>. In this article, I\u2019ll discuss how we can mitigate these issues via fine-tuning multimodal embedding models.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*Y524lLKsF5spVr0K\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@markuswinkler?utm_source=medium&amp;utm_medium=referral\">Markus Winkler<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p><iframe loading=\"lazy\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FW4s6b2ZM6kI%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DW4s6b2ZM6kI&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FW4s6b2ZM6kI%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"854\" height=\"480\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/51e4418b1f33079dd5bd6cfce1e2b8e1\/href\">https:\/\/medium.com\/media\/51e4418b1f33079dd5bd6cfce1e2b8e1\/href<\/a><\/iframe><\/p>\n<p><strong>Multimodal embeddings<\/strong> represent multiple data modalities in the same vector space such that similar concepts are co-located. A visual example of this is shown below, where semantically <strong>similar items<\/strong> (e.g. a picture of a dog and its corresponding caption) <strong>are close<\/strong>, while <strong>dissimilar items<\/strong> (e.g. a picture of a cat and a caption describing a dog) <strong>are far\u00a0apart<\/strong>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/749\/1%2Ac6UUQUk10H61dy4lCjsrtA.png?ssl=1\"><figcaption>Stock photos from Canva. Image by\u00a0author.<\/figcaption><\/figure>\n<p>A popular multimodal embedding model is CLIP, which was trained on a massive corpus of image-caption pairs using <a href=\"https:\/\/towardsdatascience.com\/multimodal-embeddings-an-introduction-5dc36975966f#09fe\">contrastive learning<\/a>. The key insight from <strong>CLIP<\/strong> was that such a model <strong>unlocks 0-shot abilities such as image classification, search, and captioning<\/strong> [1].<\/p>\n<p>One limitation here is that CLIP\u2019s 0-shot abilities <strong>may not transfer well to domains involving specialized information<\/strong> e.g. architectural drawings, medical imaging, and technical jargon. In such cases, we can improve CLIP\u2019s performance through fine-tuning.<\/p>\n<h3><strong>Fine-tuning CLIP<\/strong><\/h3>\n<p>Fine-tuning involves <strong>adapting a model to a particular use case through additional training<\/strong>. This is powerful because it enables us to build on top of existing state-of-the-art models to develop powerful specialized models with relatively small\u00a0data.<\/p>\n<p>We can do this with CLIP through the following key\u00a0steps.<\/p>\n<ol>\n<li>Collect text-image training\u00a0pairs<\/li>\n<li>Pre-process training\u00a0data<\/li>\n<li>Define Evals<\/li>\n<li>Fine-tune the\u00a0model<\/li>\n<li>Evaluate the\u00a0model<\/li>\n<\/ol>\n<p>I will discuss each of these steps in the context of a concrete example. If you are curious about what this looks like for text embedding (i.e. text-text pairs), I did that in a previous <a href=\"https:\/\/medium.com\/@shawhin\/fine-tuning-text-embeddings-f913b882b11c\">blog\u00a0post<\/a>.<\/p>\n<p><a href=\"https:\/\/shawhin.medium.com\/fine-tuning-text-embeddings-f913b882b11c\">Fine-Tuning Text Embeddings For Domain-Specific Search<\/a><\/p>\n<h3><strong>Example: Fine-tuning CLIP on YouTube Titles and Thumbnails<\/strong><\/h3>\n<p>Here, I will fine-tune CLIP on titles and thumbnails from my <a href=\"https:\/\/www.youtube.com\/@ShawhinTalebi\">YouTube channel<\/a>. At the end of this, we will have a model that can take title-thumbnail pairs and return a similarity score. This can be used for practical applications such as <strong>matching title ideas to an existing thumbnai<\/strong>l or <strong>performing search over a thumbnail library<\/strong>.<\/p>\n<p>The <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/tree\/main\/multimodal-ai\/4-ft-mm-embeddings\">example code<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/shawhin\/yt-title-thumbnail-pairs\">dataset<\/a>, and <a href=\"https:\/\/huggingface.co\/shawhin\/clip-title-thumbnail-embeddings\">fine-tuned model<\/a> are freely available on GitHub and the Hugging Face Hub, respectively. You can use this code and data to train your own models. If you end up publishing any work using this dataset, please cite the original source\u00a0\ud83d\ude42<\/p>\n<p><a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/tree\/main\/multimodal-ai\/4-ft-mm-embeddings\">GitHub Repo<\/a> | <a href=\"https:\/\/huggingface.co\/datasets\/shawhin\/yt-title-thumbnail-pairs\">Dataset<\/a> | <a href=\"https:\/\/huggingface.co\/shawhin\/clip-title-thumbnail-embeddings\">Fine-tuned Model<\/a><\/p>\n<h3><strong>Step 1: Collect Text-Image Training\u00a0Pairs<\/strong><\/h3>\n<p>The first (and most important) step of any fine-tuning process is data collection. Here, I extracted title-thumbnail pairs from my channel in a 2-step\u00a0process.<\/p>\n<p>First, I used YouTube\u2019s search API to <strong>extract the video IDs<\/strong> for all the videos on my channel. Second, I used YouTube\u2019s video API to <strong>extract the title and thumbnail URL<\/strong> of each of my long-form videos (i.e. longer than 3\u00a0min).<\/p>\n<pre># imports<br>from top_secret import my_key<br>import requests<br>from isodate import parse_duration<br><br>import pandas as pd<br>import numpy as np<br>from sentence_transformers import SentenceTransformer<br>from datasets import DatasetDict, Dataset<\/pre>\n<pre>channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID<br>page_token = None # initialize page token<br>url = 'https:\/\/www.googleapis.com\/youtube\/v3\/search' # YouTube search API <br><br># extract video data across multiple search result pages<br>video_id_list = []<br><br>while page_token != 0:<br>    params = {<br>        \"key\": my_key, <br>        'channelId': channel_id, <br>        'part': [\"snippet\",\"id\"], <br>        'order': \"date\", <br>        'maxResults':50, <br>        'pageToken': page_token<br>    }<br>    response = requests.get(url, params=params)<br><br>    for raw_item in dict(response.json())['items']:<br>        <br>        # only execute for youtube videos<br>        if raw_item['id']['kind'] != \"youtube#video\":<br>            continue<br><br>        # grab video ids<br>        video_id_list.append(raw_item['id']['videoId'])<br><br>    try:<br>        # grab next page token<br>        page_token = dict(response.json())['nextPageToken']<br>    except:<br>        # if no next page token kill while loop<br>        page_token = 0<\/pre>\n<p>Note that you will need a YouTube API key to run the above Python code, which you can create using the <a href=\"https:\/\/console.cloud.google.com\/\">Google Cloud Console<\/a>. To adapt this to your channel, you just need to change the <em>channel_id<\/em> variable.<\/p>\n<pre># extract video titles and thumbnails<br>url = \"https:\/\/www.googleapis.com\/youtube\/v3\/videos\"<br>video_data_list = []<br><br>for video_id in video_id_list:<br><br>    params = {<br>        \"part\": [\"snippet\",\"contentDetails\"],<br>        \"id\": video_id,  <br>        \"key\": my_key,  <br>    }<br>    response = requests.get(url, params=params)<br>    <br>    raw_dict = dict(response.json())['items'][0]<br><br>    # only process videos longer than 3 minutes<br>    iso_duration = raw_dict['contentDetails'][\"duration\"]<br>    if parse_duration(iso_duration).total_seconds() &lt; 180:<br>        continue<br>    <br>    # extract video data<br>    video_data = {}<br>    video_data['video_id'] = video_id<br>    video_data['title'] = raw_dict['snippet']['title']<br>    video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']<br><br>    # append data to list<br>    video_data_list.append(video_data)<\/pre>\n<p>As an additional step, I <strong>created negative thumbnail-title pairs<\/strong>. We can use these during the training process to not only guide the model with examples of which embedding should be close together (i.e. positive pair), but also which embedding should be far apart (i.e. negative\u00a0pairs).<\/p>\n<p>To do this, I computed the similarity between all possible title pairs using the sentence transformer library. Then for each positive pair, I matched the least similar title as a negative example (ensuring there were no duplicates).<\/p>\n<pre># store data in dataframe<br>df = pd.DataFrame(video_data_list)<br><br># Load the model<br>model = SentenceTransformer(\"all-mpnet-base-v2\")<br><br># Encode all titles<br>embeddings = model.encode(df['title'].to_list())<br><br># compute similarities<br>similarities = model.similarity(embeddings, embeddings)<br><br># match least JDs least similar to positive match as the negative match<br>similarities_argsorted = np.argsort(similarities.numpy(), axis=1)<br>negative_pair_index_list = []<br><br>for i in range(len(similarities)):<br><br>    # Start with the smallest similarity index for the current row<br>    j = 0<br>    index = int(similarities_argsorted[i][j])<br><br>    # Ensure the index is unique<br>    while index in negative_pair_index_list:<br>        j += 1  # Move to the next smallest index<br>        index = int(similarities_argsorted[i][j])  # Fetch next smallest index<br><br>    negative_pair_index_list.append(index)<br><br># add negative pairs to df<br>df['title_neg'] = df['title'].iloc[negative_pair_index_list].values<\/pre>\n<p>Finally, I created a <strong>train-valid-test split<\/strong> and pushed the dataset to the Hugging Face\u00a0Hub.<\/p>\n<pre># Shuffle the dataset<br>df = df.sample(frac=1, random_state=42).reset_index(drop=True)<br><br># Split into train, validation, and test sets<br>train_frac = 0.7<br>valid_frac = 0.15<br>test_frac = 0.15<br><br># define train and validation size<br>train_size = int(train_frac * len(df))<br>valid_size = int(valid_frac * len(df))<br><br># create train, validation, and test datasets<br>df_train = df[:train_size]<br>df_valid = df[train_size:train_size + valid_size]<br>df_test = df[train_size + valid_size:]<br><br># Convert the pandas DataFrames back to Hugging Face Datasets<br>train_ds = Dataset.from_pandas(df_train)<br>valid_ds = Dataset.from_pandas(df_valid)<br>test_ds = Dataset.from_pandas(df_test)<br><br># Combine into a DatasetDict<br>dataset_dict = DatasetDict({<br>    'train': train_ds,<br>    'valid': valid_ds,<br>    'test': test_ds<br>})<\/pre>\n<pre># push data to hub<br>dataset_dict.push_to_hub(\"shawhin\/yt-title-thumbnail-pairs\")<\/pre>\n<h3><strong>Step 2: Pre-process Training\u00a0Pairs<\/strong><\/h3>\n<p>Although we have all the data we need for fine-tuning, it is still not a suitable format for training. More specifically, we need to <strong>convert our image URLs to PIL image objects <\/strong>and<strong> organize our data into (anchor, positive,<\/strong> <strong>negative) triplets<\/strong>,<strong> <\/strong>i.e., a thumbnail, its corresponding title, and negative title, respectively.<\/p>\n<p>We can process all three data splits (i.e. train, valid, and test) in the following way using the Hugging Face Datasets\u00a0library.<\/p>\n<pre>from PIL import Image<br><br># load dataset<br>dataset = load_dataset(\"shawhin\/yt-title-thumbnail-pairs\")<br><br># define preprocessing function<br>def preprocess(batch):<br>    \"\"\"<br>        Preprocessing data without augmentations for test set<br>    \"\"\"<br>    # get images from urls<br>    image_list = [Image.open(requests.get(url, stream=True).raw) <br>                      for url in batch[\"thumbnail_url\"]]<br><br>    # return columns with standard names<br>    return {<br>        \"anchor\": image_list,       <br>        \"positive\": batch[\"title\"],  <br>        \"negative\": batch[\"title_neg\"]<br>    }<br><br># remove columns not relevant to training<br>columns_to_remove = [col for col in dataset['train'].column_names <br>                        if col not in ['anchor', 'positive', 'negative']]<br># apply transformations<br>dataset = dataset.map(preprocess, batched=True, <br>                         remove_columns=columns_to_remove)<\/pre>\n<p>It\u2019s important that we order our columns as (anchor, positive, negative) triplets because <strong>this is the format expected by the loss function<\/strong> we will use during training (which I learned the hard\u00a0way).<\/p>\n<h3><strong>Step 3: Define\u00a0Evals<\/strong><\/h3>\n<p>Training involves optimizing a model&#8217;s parameters to minimize a loss function. However, this value (i.e. a contrastive loss) is rarely helpful in <strong>assessing the model\u2019s performance on a downstream task<\/strong> (e.g. matching titles to thumbnails).<\/p>\n<p>A quantity that is more insightful, in this case, is the model\u2019s ability to correctly <strong>match a given thumbnail to the correct title<\/strong> among several candidates. This is denoted <strong>Recall@1<\/strong>.<\/p>\n<p>We can implement an evaluator compatible with the Sentence Transformers library to compute this metric. Since the code is quite long, I won\u2019t paste it here, but the curious reader can find it in Cell 12 of <a href=\"https:\/\/github.com\/ShawhinT\/YouTube-Blog\/blob\/main\/multimodal-ai\/4-ft-mm-embeddings\/2-finetune_clip_sbert.ipynb\">this notebook<\/a>.<\/p>\n<pre># function to create new evaluator given data split<br>def create_recall_evaluator(set_name, k=1):<br>    \"\"\"<br>        Create triplet evaluator for \"train\", \"valid\", or \"test\" split<br>    \"\"\"<br><br>    return ImageTextRetrievalEvaluator(<br>        images=dataset[f\"{set_name}\"][\"anchor\"],<br>        texts=dataset[f\"{set_name}\"][\"positive\"],<br>        name=f\"yt-title-thumbnail-{set_name}\",<br>        k=k<br>    )<br><br># Create new evaluator with Recall@k<br>evaluator_recall_train = create_recall_evaluator(\"train\", k=1)<br>evaluator_recall_valid = create_recall_evaluator(\"valid\", k=1)<br><br>print(\"Train:\", evaluator_recall_train(model))<br>print(\"Valid:\", evaluator_recall_valid(model))<br><br># &gt;&gt; Train: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}<br># &gt;&gt; Valid: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}<\/pre>\n<p>We can see the model already has decent performance out-of-the-box, with correct titles being matched 66% of the\u00a0time.<\/p>\n<h3><strong>Step 4: Fine-tune the\u00a0Model<\/strong><\/h3>\n<p>There are <strong>3 key things<\/strong> we must do before training the model. Namely, choose which parameters to train, pick a loss function, and set hyperparameters.<\/p>\n<h4><strong>Trainable Parameters<\/strong><\/h4>\n<p>The key limitation of this project is that I\u2019ve only posted 76 YouTube videos (as of writing this). With the validation and test splits, this leaves<strong> only 53 examples for training<\/strong>.<\/p>\n<p>Since we have so few training examples, <strong>limiting the number of parameters we train is a good idea<\/strong>. In this case, I only train the final projection layer of the model, which maps the text and image embeddings into a shared vector space. This is about 1M parameters total.<\/p>\n<pre># import model<br>from sentence_transformers import SentenceTransformer<br>model = SentenceTransformer(\"sentence-transformers\/clip-ViT-L-14\")<br><br># pick specific layers to train (note: you can add more layers to this list)<br>trainable_layers_list = ['projection']<br><br># Apply freezing configuration<br>for name, param in model.named_parameters():<br>    <br>    # freeze all params<br>    param.requires_grad = False<br><br>    # unfreeze layers in trainable_layers_list<br>    if any(layer in name for layer in trainable_layers_list):<br>        param.requires_grad = True<\/pre>\n<pre># Count total and trainable parameters<br>total_params = sum(p.numel() for p in model.parameters())<br>trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)<br><br>print(f\"Total parameters: {total_params:,}\")<br>print(f\"Trainable parameters: {trainable_params:,}\")<br>print(f\"% of trainable parameters: {100*trainable_params\/total_params:.2f}%\")<br><br># &gt;&gt; Total parameters: 427,616,513<br># &gt;&gt; Trainable parameters: 1,376,256<br># &gt;&gt; % of trainable parameters: 0.32%<\/pre>\n<h4><strong>Loss function<\/strong><\/h4>\n<p>Here, I use the <a href=\"https:\/\/sbert.net\/docs\/package_reference\/sentence_transformer\/losses.html#multiplenegativesrankingloss\">Multiple Negatives Ranking Loss<\/a> from the Sentence Transformers library (which works with single negatives like in this case). It works by <strong>maximizing the similarity between positive pairs<\/strong> while <strong>minimizing the similarity between negative pairs<\/strong>. Here\u2019s what the loss function looks like for the single negative case\u00a0[2].<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/732\/1%2A5Rw76x_hN5cQUz5VAwILcQ.png?ssl=1\"><figcaption>Mulitple negatives loss function (with only 1 negative). Image by\u00a0author.<\/figcaption><\/figure>\n<pre>from sentence_transformers.losses import MultipleNegativesRankingLoss<br><br># define loss<br>loss = MultipleNegativesRankingLoss(model)<\/pre>\n<h4><strong>Hyperparameters<\/strong><\/h4>\n<p>For hyperparameters, I experimented with a handful of choices manually and picked the choice with the best validation loss and Recall@1 performance. Here are the final\u00a0choices.<\/p>\n<pre>from sentence_transformers import SentenceTransformerTrainingArguments<br><br># hyperparameters<br>num_epochs = 2<br>batch_size = 16<br>lr = 1e-4<br>finetuned_model_name = \"clip-title-thumbnail-embeddings\"<br><br>train_args = SentenceTransformerTrainingArguments(<br>    output_dir=f\"models\/{finetuned_model_name}\",<br>    num_train_epochs=num_epochs,<br>    per_device_train_batch_size=batch_size,<br>    per_device_eval_batch_size=batch_size,<br>    learning_rate=lr,<br>    # Evaluation settings<br>    eval_strategy=\"epoch\",<br>    eval_steps=1,<br>    logging_steps=1,<br>)<\/pre>\n<p>With our loss and hyperparameters defined, we can train the model using the SentenceTransformersTrainer().<\/p>\n<pre>from sentence_transformers import SentenceTransformerTrainer<br><br>trainer = SentenceTransformerTrainer(<br>    model=model,<br>    args=train_args,<br>    train_dataset=dataset[\"train\"],<br>    eval_dataset=dataset[\"valid\"],<br>    loss=loss,<br>    evaluator=[evaluator_recall_train, evaluator_recall_valid],<br>)<br>trainer.train()<\/pre>\n<p><strong>Model training is an<\/strong> <strong>iterative process<\/strong> where you may explore dozens of models for different choices of trainable parameters, loss functions, and hyperparameters.<\/p>\n<p>However, I highly recommend <strong>keeping these experiments as simple as possible<\/strong>. If you find yourself spending too much time tweaking training args to get your model to converge, there\u2019s probably something fundamentally wrong with your data (speaking from experience \ud83d\ude05).<\/p>\n<h3><strong>Step 5: Evaluate the\u00a0Model<\/strong><\/h3>\n<p>As a final step, we can evaluate the model\u2019s Recall@1 score on the testing set. These data were not used for training or hyperparameter tuning, so it gives us an unbiased assessment of the\u00a0model.<\/p>\n<pre>evaluator_recall_test = create_recall_evaluator(\"test\")<br><br>print(\"Train:\", evaluator_recall_train(model))<br>print(\"Valid:\", evaluator_recall_valid(model))<br>print(\"Test:\", evaluator_recall_test(model))<br><br># &gt;&gt; Train: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}<br># &gt;&gt; Valid: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}<br># &gt;&gt; Test: {'yt-title-thumbnail-test_Recall@1': 0.75}<\/pre>\n<p>We see that the model performs well across all three datasets with <strong>75% Recall@1 on the test set<\/strong>. In other words, 75% of the time, the model correctly matches a given thumbnail to its original title. Additionally, the recall for the validation dataset increases by\u00a027%!<\/p>\n<h3><strong>What\u2019s Next?<\/strong><\/h3>\n<p>Multimodal embedding models, like CLIP, unlock countless 0-shot use cases such as image classification and retrieval. Here, we saw how we can fine-tune such a model to adapt it to a specialized domain (i.e. my YouTube titles and thumbnails).<\/p>\n<p>Although CLIP is a small model by today\u2019s standards (~500M parameters) and our training dataset was tiny, <strong>the final model still demonstrated strong performance on this task<\/strong>. This highlights the power of fine-tuning.<\/p>\n<p>If you have any questions or suggestions for future content, let me know in the comments\u00a0\ud83d\ude42<\/p>\n<p><strong>More on Multimodal AI\u00a0\ud83d\udc47<\/strong><\/p>\n<p><a href=\"https:\/\/shawhin.medium.com\/list\/fe9521d0e77a\">Multimodal AI<\/a><\/p>\n<p><strong>\ud83d\uddde\ufe0f Get exclusive access to AI resources and project ideas<\/strong>: <a href=\"https:\/\/the-data-entrepreneurs.kit.com\/shaw\">https:\/\/the-data-entrepreneurs.kit.com\/shaw<\/a><\/p>\n<p><strong>\ud83e\uddd1\u200d\ud83c\udf93 Learn AI in 6 weeks by building it<\/strong>: <a href=\"https:\/\/maven.com\/shaw-talebi\/ai-builders-bootcamp?promoCode=AI25\">https:\/\/maven.com\/shaw-talebi\/ai-builders-bootcamp?promoCode=AI25<\/a><\/p>\n<h3>References<\/h3>\n<p>[1] <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\">arXiv:2103.00020<\/a><strong> [cs.CV]<\/strong><\/p>\n<p>[2] <a href=\"https:\/\/arxiv.org\/abs\/1705.00652\">arXiv:1705.00652<\/a><strong> [cs.CL]<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=bf007b1c5da5\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/medium.com\/towards-data-science\/fine-tuning-multimodal-embedding-models-bf007b1c5da5\">Fine-tuning Multimodal Embedding Models<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shaw Talebi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/towards-data-science\/fine-tuning-multimodal-embedding-models-bf007b1c5da5\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Fine-tuning Multimodal Embedding Models Adapting CLIP to YouTube Data (with Python\u00a0Code) This is the 4th article in a larger series on multimodal AI. In the previous post, we discussed multimodal RAG systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,1112,70,1590,1287],"tags":[1591,135,1113],"class_list":["post-1588","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-fine-tuning","category-machine-learning","category-sentence-transformers","category-transformers","tag-clip","tag-fine","tag-tuning"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1588"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1588"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1588\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1588"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1588"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1588"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}