{"id":866,"date":"2024-12-30T07:03:01","date_gmt":"2024-12-30T07:03:01","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/30\/segmenting-water-in-satellite-images-using-paligemma-b172dc0cf55d\/"},"modified":"2024-12-30T07:03:01","modified_gmt":"2024-12-30T07:03:01","slug":"segmenting-water-in-satellite-images-using-paligemma-b172dc0cf55d","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/30\/segmenting-water-in-satellite-images-using-paligemma-b172dc0cf55d\/","title":{"rendered":"Segmenting Water in Satellite Images Using Paligemma"},"content":{"rendered":"<p>    Segmenting Water in Satellite Images Using Paligemma<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Some insights on using Google\u2019s latest Vision Language\u00a0Model<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/506\/1%2AaG0wm_9RuieOSxFYwxqN2g.png?ssl=1\"><figcaption>Hutt Lagoon, Australia. Depending on the season, time of day, and cloud coverage, this lake changes from red to pink or purple. Source: Google\u00a0Maps.<\/figcaption><\/figure>\n<p>Multimodal models are architectures that simultaneously integrate and process different data types, such as text, images, and audio. Some examples include CLIP and DALL-E from OpenAI, both released in 2021. CLIP understands images and text jointly, allowing it to perform tasks like zero-shot image classification. DALL-E, on the other hand, generates images from textual descriptions, allowing the automation and enhancement of creative processes in gaming, advertising, and literature, among other\u00a0sectors.<\/p>\n<p>Visual language models (VLMs) are a special case of multimodal models. VLMs generate language based on visual inputs. One prominent example is Paligemma, which Google introduced in May 2024. Paligemma can be used for Visual Question Answering, object detection, and image segmentation.<\/p>\n<p>Some blog posts explore the capabilities of Paligemma in object detection, such as this excellent read from Roboflow:<\/p>\n<p><a href=\"https:\/\/blog.roboflow.com\/how-to-fine-tune-paligemma\/\">Fine-tune PaliGemma for Object Detection with Custom Data<\/a><\/p>\n<p>However, by the time I wrote this blog, the existing documentation on preparing data to use Paligemma for object segmentation was vague. That is why I wanted to evaluate whether it is easy to use Paligemma for this task. Here, I share my experience.<\/p>\n<h3>Brief introduction of Paligemma<\/h3>\n<p>Before going into detail on the use case, let\u2019s briefly revisit the inner workings of Paligemma.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/592\/1%2AcO0kQhmvh0iYdWga5jdmaA.png?ssl=1\"><figcaption>Architecture of Paligemma2. Source: <a href=\"https:\/\/arxiv.org\/abs\/2412.03555\">https:\/\/arxiv.org\/abs\/2412.03555<\/a><\/figcaption><\/figure>\n<p>Paligemma combines a <a href=\"https:\/\/arxiv.org\/abs\/2303.15343\">SigLIP-So400m vision encoder<\/a> with a <a href=\"https:\/\/developers.googleblog.com\/en\/gemma-explained-overview-gemma-model-family-architectures\/\">Gemma language model<\/a> to process images and text (see figure above). In the new version of Paligemma released in December of this year, the vision encoder can preprocess images at three different resolutions: 224px, 448px, or 896px. The vision encoder preprocesses an image and outputs a sequence of image tokens, which are linearly combined with input text tokens. This combination of tokens is further processed by the Gemma language model, which outputs text tokens. The Gemma model has different sizes, from 2B to 27B parameters.<\/p>\n<p>An example of model output is shown in the following figure.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/584\/1%2AeZpxhyVuoaS6A7AB-OiSjQ.png?ssl=1\"><figcaption>Example of an object segmentation output. Source: <a href=\"https:\/\/arxiv.org\/abs\/2412.03555\">https:\/\/arxiv.org\/abs\/2412.03555<\/a><\/figcaption><\/figure>\n<p>The Paligemma model was trained on various datasets such as <a href=\"https:\/\/paperswithcode.com\/dataset\/webli\">WebLi<\/a>, <a href=\"https:\/\/storage.googleapis.com\/openimages\/web\/index.html\">openImages<\/a>, <a href=\"https:\/\/github.com\/google-research-datasets\/wit\">WIT<\/a>, and others (see this <a href=\"https:\/\/www.kaggle.com\/models\/google\/paligemma\">Kaggle blog<\/a> for more details). This means that Paligemma can identify objects without fine-tuning. However, such abilities are limited. That\u2019s why Google recommends fine-tuning Paligemma in domain-specific use\u00a0cases.<\/p>\n<h4>Input format<\/h4>\n<p>To fine-tune Paligemma, the input data needs to be in JSONL format. A dataset in JSONL format has each line as a separate JSON object, like a list of individual records. Each JSON object contains the following keys:<\/p>\n<p><strong>Image:<\/strong> The image\u2019s\u00a0name.<\/p>\n<p><strong>Prefix: <\/strong>This specifies the task you want the model to\u00a0perform.<\/p>\n<p><strong>Suffix:<\/strong> This provides the ground truth the model learns to make predictions.<\/p>\n<p>Depending on the task, you must change the JSON object&#8217;s prefix and suffix accordingly. Here are some examples:<\/p>\n<ul>\n<li><strong>Image captioning:<\/strong><\/li>\n<\/ul>\n<pre>{\"image\": \"some_filename.png\", <br> \"prefix\": \"caption en\" (To indicate that the model should generate an English caption for an image),<br> \"suffix\": \"This is an image of a big, white boat traveling in the ocean.\"<br>}<\/pre>\n<ul>\n<li><strong>Question answering:<\/strong><\/li>\n<\/ul>\n<pre>{\"image\": \"another_filename.jpg\", <br> \"prefix\": \"How many people are in the image?\",<br> \"suffix\": \"ten\"<br>}<\/pre>\n<ul>\n<li><strong>Object detection:<\/strong><\/li>\n<\/ul>\n<pre>{\"image\": \"filename.jpeg\", <br> \"prefix\": \"detect airplane\",<br> \"suffix\": \"&lt;loc0055&gt;&lt;loc0115&gt;&lt;loc1023&gt;&lt;loc1023&gt; airplane\" (four corner bounding box coords)<br>}<\/pre>\n<p>If you have several categories to be detected, add a semicolon (;) among each category in the prefix and\u00a0suffix.<\/p>\n<p>A complete and clear explanation of how to prepare the data for object detection in Paligemma can be found in <a href=\"https:\/\/blog.roboflow.com\/how-to-fine-tune-paligemma\/\">this Roboflow\u00a0post<\/a>.<\/p>\n<ul>\n<li><strong>Image segmentation:<\/strong><\/li>\n<\/ul>\n<pre>{\"image\": \"filename.jpeg\", <br> \"prefix\": \"detect airplane\",<br> \"suffix\": \"&lt;loc0055&gt;&lt;loc0115&gt;&lt;loc1023&gt;&lt;loc1023&gt;&lt;seg063&gt;&lt;seg108&gt;&lt;seg045&gt;&lt;seg028&gt;&lt;seg056&gt;&lt;seg052&gt;&lt;seg114&gt;&lt;seg005&gt;&lt;seg042&gt;&lt;seg023&gt;&lt;seg084&gt;&lt;seg064&gt;&lt;seg086&gt;&lt;seg077&gt;&lt;seg090&gt;&lt;seg054&gt; airplane\" <br>}<\/pre>\n<p>Note that for segmentation, apart from the object\u2019s bounding box coordinates, you need to specify 16 extra segmentation tokens representing a mask that fits within the bounding box. According to <a href=\"https:\/\/github.com\/google-research\/big_vision\/blob\/main\/big_vision\/configs\/proj\/paligemma\/README.md#tokenizer\">Google\u2019s Big Vision repository<\/a>, those tokens are codewords with 128 entries (&lt;seg000&gt;\u2026&lt;seg127&gt;). How do we obtain these values? In my personal experience, it was challenging and frustrating to get them without proper documentation. But I\u2019ll give more details\u00a0later.<\/p>\n<p>If you are interested in learning more about Paligemma, I recommend these\u00a0blogs:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/blog\/paligemma2\">Welcome PaliGemma 2 &#8211; New vision language models by Google<\/a><\/li>\n<li><a href=\"https:\/\/www.datature.io\/blog\/introducing-paligemma-googles-latest-visual-language-model\">Introducing PaliGemma: Google&#8217;s Latest Visual Language Model<\/a><\/li>\n<\/ul>\n<h3>Satellite images of water\u00a0bodies<\/h3>\n<p>As mentioned above, Paligemma was trained on different datasets. Therefore, this model is expected to be good at segmenting \u201ctraditional\u201d objects such as cars, people, or animals. But what about segmenting objects in satellite images? This question led me to explore Paligemma\u2019s capabilities for segmenting water in satellite images.<\/p>\n<p>Kaggle\u2019s <a href=\"https:\/\/www.kaggle.com\/datasets\/franciscoescobar\/satellite-images-of-water-bodies\">Satellite Image of Water Bodies dataset<\/a> is suitable for this purpose. This dataset contains 2841 images with their corresponding masks.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/890\/1%2Aerw_D-Y0KxKSfvy5pafoJg.png?ssl=1\"><figcaption>Here&#8217;s an example of the water bodies dataset: The RGB image is shown on the left, while the corresponding mask appears on the\u00a0right.<\/figcaption><\/figure>\n<p>Some masks in this dataset were incorrect, and others needed further preprocessing. Faulty examples include masks with all values set to water, while only a small portion was present in the original image. Other masks did not correspond to their RGB images. When an image is rotated, some masks make these areas appear as if they have\u00a0water.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/433\/1%2AUQQCzEH0soY6T7N_989EQQ.png?ssl=1\"><figcaption>Example of a rotated mask. When reading this image in Python, the area outside the image appears as it would have water. In this case, image rotation is needed to correct this mask. Image made by the\u00a0author.<\/figcaption><\/figure>\n<p>Given these data limitations, I selected a sample of 164 images for which the masks did not have any of the problems mentioned above. This set of images is used to fine-tune Paligemma.<\/p>\n<h4>Preparing the JSONL\u00a0dataset<\/h4>\n<p>As explained in the previous section, Paligemma needs entries that represent the object\u2019s bounding box coordinates in normalized image-space (&lt;loc0000&gt;\u2026&lt;loc1023&gt;) plus an extra 16 segmentation tokens representing 128 different codewords (&lt;seg000&gt;\u2026&lt;seg127&gt;). Obtaining the bounding box coordinates in the desired format was easy, thanks to <a href=\"https:\/\/blog.roboflow.com\/how-to-fine-tune-paligemma\/\">Roboflow\u2019s explanation<\/a>. But how do we obtain the 128 codewords from the masks? There was no clear documentation or examples in the Big Vision repository that I could use for my use case. I naively thought that the process of creating the segmentation tokens was similar to that of making the bounding boxes. However, this led to an incorrect representation of the water masks, which led to wrong prediction results.<\/p>\n<p>By the time I wrote this blog (beginning of December), Google announced the second version of Paligemma. Following this event, Roboflow published <a href=\"https:\/\/blog.roboflow.com\/fine-tune-paligemma-2\/\">a nice overview<\/a> of preparing data to fine-tune Paligemma2 for different applications, including image segmentation. I use part of their code to finally obtain the correct segmentation codewords. What was my mistake? Well, first of all, the masks need to be resized to a tensor of shape [None, 64, 64, 1] and then use a pre-trained variational auto-encoder (VAE) to convert annotation masks into text labels. Although the usage of a VAE model was briefly mentioned in the Big Vision repository, there is no explanation or examples on how to use\u00a0it.<\/p>\n<p>The workflow I use to prepare the data to fine-tune Paligemma is shown\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/891\/1%2AiCXuZzahO_9iGwTNQmmAfQ.png?ssl=1\"><figcaption>Steps to convert one original mask from the filtered <a href=\"https:\/\/www.kaggle.com\/datasets\/franciscoescobar\/satellite-images-of-water-bodies\">water bodies dataset<\/a> to a JSON object. This process is repeated over the 164 images of the train set and the 21 images of the test dataset to build the JSONL\u00a0dataset.<\/figcaption><\/figure>\n<p>As observed, the number of steps needed to prepare the data for Paligemma is large, so I don\u2019t share code snippets here. However, if you want to explore the code, you can visit <a href=\"https:\/\/github.com\/anamabo\/SegmentWaterWithPaligemma\">this GitHub repository<\/a>. The script <em>convert.py<\/em> has all the steps mentioned in the workflow shown above. I also added the selected images so you can play with this script immediately.<\/p>\n<p>When preprocessing the segmentation codewords back to segmentation masks, we note how these masks cover the water bodies in the\u00a0images:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AAGnPsJFen8ykU6yoaEVqFA.png?ssl=1\"><figcaption>Resulting masks when decoding the segmentation codewords in the train set. Image made by the author using <a href=\"https:\/\/github.com\/anamabo\/SegmentWaterWithPaligemma\/blob\/main\/finetune_paligemma_for_segmentation.ipynb\">this Notebook<\/a>.<\/figcaption><\/figure>\n<h3>How is Paligemma at segmenting water in satellite images?<\/h3>\n<p>Before fine-tuning Paligemma, I tried its segmentation capabilities on the models uploaded to Hugging Face. This platform ha<a href=\"https:\/\/huggingface.co\/spaces\/big-vision\/paligemma\">s a demo<\/a> where you can upload images and interact with different Paligemma models.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AUOf1poS5PE64cBpz9IzdmA.gif?ssl=1\"><figcaption>Default Paligemma model at segmenting water in satellite images.<\/figcaption><\/figure>\n<p>The current version of Paligemma is generally good at segmenting water in satellite images, but it\u2019s not perfect. Let\u2019s see if we can improve these\u00a0results!<\/p>\n<p>There are two ways to fine-tune Paligemma, either through <a href=\"https:\/\/huggingface.co\/blog\/paligemma#using-transformers-1\">Hugging Face\u2019s Transformer library<\/a> or by using Big Vision and JAX. I went for this last option. Big Vision provides a <a href=\"https:\/\/colab.research.google.com\/github\/google-research\/big_vision\/blob\/main\/big_vision\/configs\/proj\/paligemma\/finetune_paligemma.ipynb\">Colab notebook<\/a>, which I modified for my use case. You can open it by going to my <a href=\"https:\/\/github.com\/anamabo\/SegmentWaterWithPaligemma?tab=readme-ov-file\">GitHub repository<\/a>:<\/p>\n<p><a href=\"https:\/\/github.com\/anamabo\/SegmentWaterWithPaligemma\/blob\/main\/finetune_paligemma_for_segmentation.ipynb\">SegmentWaterWithPaligemma\/finetune_paligemma_for_segmentation.ipynb at main \u00b7 anamabo\/SegmentWaterWithPaligemma<\/a><\/p>\n<p>I used a <em>batch size<\/em> of 8 and a <em>learning rate<\/em> of 0.003. I ran the training loop twice, which translates to 158 training steps. The total running time using a T4 GPU machine was 24\u00a0minutes.<\/p>\n<p>The results were not as expected. Paligemma did not produce predictions in some images, and in others, the resulting masks were far from the ground truth. I also obtained segmentation codewords with more than 16 tokens in two\u00a0images.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/530\/1%2A1jdVzbu4xmFoSa89umsjrw.png?ssl=1\"><figcaption>Results of the fine-tuning where there were predictions. Image made by the\u00a0author.<\/figcaption><\/figure>\n<p>It\u2019s worth mentioning that I use the first Paligemma version. Perhaps the results are improved when using Paligemma2 or by tweaking the batch size or learning rate further. In any case, these experiments are out of the scope of this\u00a0blog.<\/p>\n<p>The demo results show that the default Paligemma model is better at segmenting water than my finetuned model. In my opinion, UNET is a better architecture if the aim is to build a model specialized in segmenting objects. For more information on how to train such a model, you can read my previous blog\u00a0post:<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/detecting-clouds-with-ai-b553e6576af6\">Detecting Clouds with AI<\/a><\/p>\n<h4>Other limitations:<\/h4>\n<p>I want to mention some other challenges I encountered when fine-tuning Paligemma using Big Vision and\u00a0JAX.<\/p>\n<ul>\n<li>Setting up different model configurations is difficult because there\u2019s still little documentation on those parameters.<\/li>\n<li>The first version of Paligemma has been trained to handle images of different aspect ratios resized to 224&#215;224. Make sure to resize your input images with this size only. This will prevent raising exceptions.<\/li>\n<li>When fine-tuning with Big Vision and JAX, You might have JAX GPU-related problems. Ways to overcome this issue\u00a0are:<\/li>\n<\/ul>\n<p>a. Reducing the samples in your training and validation datasets.<\/p>\n<p>b. Increasing the batch size from 8 to 16 or\u00a0higher.<\/p>\n<ul>\n<li>The fine-tuned model has a size of ~ 5GB. Make sure to have enough space in your Drive to store\u00a0it.<\/li>\n<\/ul>\n<h3>Takeaway messages<\/h3>\n<p>Discovering a new AI model is exciting, especially in this age of multimodal algorithms transforming our society. However, working with state-of-the-art models can sometimes be challenging due to the lack of available documentation. Therefore, the launch of a new AI model should be accompanied by comprehensive documentation to ensure its smooth and widespread adoption, especially among professionals who are still inexperienced in this\u00a0area.<\/p>\n<p>Despite the difficulties I encountered fine-tuning Paligemma, the current pre-trained models are powerful at doing zero-shot object detection and image segmentation, which can be used for many applications, including assisted ML labeling.<\/p>\n<p>Are you using Paligemma in your Computer Vision projects? Share your experience fine-tuning this model in the comments!<\/p>\n<p>I hope you enjoyed this post. Once more, thanks for\u00a0reading!<\/p>\n<p>You can contact me via LinkedIn\u00a0at:<\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/camartinezbarbosa\/\">https:\/\/www.linkedin.com\/in\/camartinezbarbosa\/<\/a><\/p>\n<p><em>Acknowledgments: I want to thank Jos\u00e9 Celis-Gil for all the fruitful discussions on data preprocessing and modeling.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=b172dc0cf55d\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/segmenting-water-in-satellite-images-using-paligemma-b172dc0cf55d\">Segmenting Water in Satellite Images Using Paligemma<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Dr. Carmen Adriana Mart\u00ednez Barbosa<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fsegmenting-water-in-satellite-images-using-paligemma-b172dc0cf55d\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Segmenting Water in Satellite Images Using Paligemma Some insights on using Google\u2019s latest Vision Language\u00a0Model Hutt Lagoon, Australia. Depending on the season, time of day, and cloud coverage, this lake changes from red to pink or purple. Source: Google\u00a0Maps. Multimodal models are architectures that simultaneously integrate and process different data types, such as text, images, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,221,70,1003,1002,1001],"tags":[1005,146,1004],"class_list":["post-866","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-computer-vision","category-machine-learning","category-paligemma","category-satellite-imagery","category-visual-language-model","tag-images","tag-language","tag-paligemma"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/866"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=866"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/866\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}