{"id":1638,"date":"2025-02-04T07:02:59","date_gmt":"2025-02-04T07:02:59","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/04\/show-and-tell-e1a1142456e2\/"},"modified":"2025-02-04T07:02:59","modified_gmt":"2025-02-04T07:02:59","slug":"show-and-tell-e1a1142456e2","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/04\/show-and-tell-e1a1142456e2\/","title":{"rendered":"Show and Tell"},"content":{"rendered":"<p>    Show and Tell<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"565258\" data-has-transparency=\"false\" style=\"--dominant-color: #565258;\" fetchpriority=\"high\" decoding=\"async\" width=\"2560\" height=\"1707\" class=\"wp-image-597161 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-scaled.jpg?resize=2560%2C1707&#038;ssl=1\" alt=\"Photo by St\u00e5le Grut on Unsplash\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-scaled.jpg 2560w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-300x200.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-1024x683.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-768x512.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-1536x1024.jpg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0PwKPzH0bOaxiLRQH-2048x1365.jpg 2048w\" sizes=\"(max-width: 2560px) 100vw, 2560px\"><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@stalebg?utm_source=medium&amp;utm_medium=referral\">St\u00e5le Grut<\/a> on <a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n<p class=\"wp-block-paragraph\">Natural Language Processing and <a href=\"https:\/\/towardsdatascience.com\/tag\/computer-vision\/\" title=\"Computer Vision\">Computer Vision<\/a> used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.<\/p>\n<p class=\"wp-block-paragraph\">Today, many language models have capability to generate images based on the given prompt. That\u2019s one example of the bridge between NLP and Computer Vision. But I guess I\u2019ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name suggests, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.<\/p>\n<p class=\"wp-block-paragraph\">One of the earliest papers in this topic is the one titled &#8220;<em>Show and Tell: A Neural Image Caption Generator<\/em>&#8221; written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the <a href=\"https:\/\/towardsdatascience.com\/tag\/deep-learning\/\" title=\"Deep Learning\">Deep Learning<\/a> model proposed in the paper using PyTorch. Note that I won\u2019t actually demonstrate the training process here as that\u2019s a topic on its own. Let me know in the comments if you want a separate tutorial on that.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<h2 class=\"wp-block-heading\">Image Captioning Framework<\/h2>\n<p class=\"wp-block-paragraph\">Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks \u2013 yes, you\u2019re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called <em>encoder<\/em>), whereas the RNN is used for generating a sequence of words based on the features encoded by the CNN (hence the RNN part is called <em>decoder<\/em>).<\/p>\n<p class=\"wp-block-paragraph\">It is discussed in the paper that the authors attempted to do so using GoogLeNet (a.k.a., Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below shows what the proposed architecture looks like.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"ebebec\" data-has-transparency=\"true\" style=\"--dominant-color: #ebebec;\" decoding=\"async\" width=\"867\" height=\"691\" class=\"wp-image-597162 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kKTqOvW7PgvE7vVZKDjAJg.png?resize=867%2C691&#038;ssl=1\" alt=\"Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kKTqOvW7PgvE7vVZKDjAJg.png 867w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kKTqOvW7PgvE7vVZKDjAJg-300x239.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kKTqOvW7PgvE7vVZKDjAJg-768x612.png 768w\" sizes=\"(max-width: 867px) 100vw, 867px\"><figcaption class=\"wp-element-caption\">Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely <em>init-inject<\/em>, <em>pre-inject<\/em>, <em>par-inject<\/em> and <em>merge<\/em>, as mentioned in [3]. In the case of the <em>Show and Tell<\/em> paper, authors used <em>pre-inject<\/em>, a method where the features extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image features.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"f3f3f3\" data-has-transparency=\"false\" style=\"--dominant-color: #f3f3f3;\" decoding=\"async\" width=\"1255\" height=\"692\" class=\"wp-image-597163 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1lIqALUziG9p9abVaosyyVA.png?resize=1255%2C692&#038;ssl=1\" alt=\"Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1lIqALUziG9p9abVaosyyVA.png 1255w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1lIqALUziG9p9abVaosyyVA-300x165.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1lIqALUziG9p9abVaosyyVA-1024x565.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1lIqALUziG9p9abVaosyyVA-768x423.png 768w\" sizes=\"(max-width: 1255px) 100vw, 1255px\"><figcaption class=\"wp-element-caption\">Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the <em>pre-inject<\/em> method (b).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As we already understood the theory behind the image captioning model, we can now jump into the code!<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<h1 class=\"wp-block-heading\">Implementation<\/h1>\n<p class=\"wp-block-paragraph\">I\u2019ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 1\nimport torch  #(1)\nimport torch.nn as nn  #(2)\nimport torchvision.models as models  #(3)\nfrom torchvision.models import GoogLeNet_Weights  #(4)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s break down these imports quickly: the line marked with <code>#(1)<\/code> is used for basic operations, line <code>#(2)<\/code> is for initializing neural network layers, line <code>#(3)<\/code> is for loading various deep learning models, and <code>#(4)<\/code> is the pretrained weights for the GoogLeNet model.<\/p>\n<p class=\"wp-block-paragraph\">Talking about the parameter configuration, <code>EMBED_DIM<\/code> and <code>LSTM_HIDDEN_DIM<\/code> are the only two parameters mentioned in the paper, which are both set to 512 as shown at line <code>#(1)<\/code> and <code>#(2)<\/code> in the Codeblock 2 below. The <code>EMBED_DIM<\/code> variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, <code>LSTM_HIDDEN_DIM<\/code> is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line <code>#(3)<\/code> I set the <code>NUM_LSTM_LAYERS<\/code> variable to 1.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 2\nEMBED_DIM       = 512    #(1)\nLSTM_HIDDEN_DIM = 512    #(2)\nNUM_LSTM_LAYERS = 1      #(3)\n\nIMAGE_SIZE      = 224    #(4)\nIN_CHANNELS     = 3      #(5)\n\nSEQ_LENGTH      = 30     #(6)\nVOCAB_SIZE      = 10000  #(7)\n\nBATCH_SIZE      = 1<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The next two parameters are related to the input image, namely <code>IMAGE_SIZE<\/code> (<code>#(4)<\/code>) and <code>IN_CHANNELS<\/code> (<code>#(5)<\/code>). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3\u00d7224\u00d7224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words (<code>#(6)<\/code>) and the number of unique words in the dictionary is 10000 (<code>#(7)<\/code>). Lastly, the <code>BATCH_SIZE<\/code> parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.<\/p>\n<h3 class=\"wp-block-heading\">GoogLeNet Encoder<\/h3>\n<p class=\"wp-block-paragraph\">It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let\u2019s see what the GoogLeNet architecture looks like using the following code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 3\nmodels.googlenet()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the <code>fc<\/code> layer marked with <code>#(1)<\/code> in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 (<code>EMBED_DIM<\/code>). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 3 Output\nGoogLeNet(\n  (conv1): BasicConv2d(\n    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)\n    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)\n  )\n  (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)\n  (conv2): BasicConv2d(\n    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)\n  )\n\n  .\n  .\n  .\n  .\n\n  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))\n  (dropout): Dropout(p=0.2, inplace=False)\n  (fc): Linear(in_features=1024, out_features=1000, bias=True)  #(1)\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now let\u2019s actually load and modify the GoogLeNet model, which I do in the <code>InceptionEncoder<\/code> class below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4a\nclass InceptionEncoder(nn.Module):\n    def __init__(self, fine_tune):  #(1)\n        super().__init__()\n        self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1)  #(2)\n        self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features,  #(3)\n                                      out_features=EMBED_DIM)  #(4)\n\n        if fine_tune == True:       #(5)\n            for param in self.googlenet.parameters():\n                param.requires_grad = True\n        else:\n            for param in self.googlenet.parameters():\n                param.requires_grad = False\n\n        for param in self.googlenet.fc.parameters():\n            param.requires_grad = True<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first thing we do in the above code is to load the model using <code>models.googlenet()<\/code>. It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass <code>GoogLeNet_Weights.IMAGENET1K_V1<\/code> into the <code>weights<\/code> parameter, as shown at line <code>#(2)<\/code> in Codeblock 4a. Next, at line <code>#(3)<\/code> we access the classification head through the <code>fc<\/code> attribute, where we replace the existing linear layer with a new one having the output dimension of 512 (<code>EMBED_DIM<\/code>) (<code>#(4)<\/code>). Since this GoogLeNet model is already trained, we don\u2019t need to train it from scratch. Instead, we can either perform <em>fine-tuning<\/em> or <em>transfer learning<\/em> in order to adapt it to the image captioning task.<\/p>\n<p class=\"wp-block-paragraph\">In case you\u2019re not yet familiar with the two terms, <em>fine-tuning<\/em> is a method where we update the weights of the entire model. On the other hand, <em>transfer learning<\/em> is a technique where we only update the weights of the layers we replaced (in this case it\u2019s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named <code>fine_tune<\/code> at line <code>#(1)<\/code> which will let the model to perform fine-tuning whenever it is set to <code>True<\/code> (<code>#(5)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">The <code>forward()<\/code> method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can better understand how the <code>InceptionEncoder<\/code> model works.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4b\n    def forward(self, images):\n        print(f'originalt: {images.size()}')\n        features = self.googlenet(images)\n        print(f'after googlenett: {features.size()}')\n\n        return features<\/code><\/pre>\n<p class=\"wp-block-paragraph\">To test whether our decoder works properly, we can pass a dummy tensor of size 1\u00d73\u00d7224\u00d7224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224\u00d7224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 5\ninception_encoder = InceptionEncoder(fine_tune=True)\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)\nfeatures = inception_encoder(images)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 5 Output\noriginal         : torch.Size([1, 3, 224, 224])\nafter googlenet  : torch.Size([1, 512])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">LSTM Decoder<\/h3>\n<p class=\"wp-block-paragraph\">As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer (<code>#(1)<\/code>), the LSTM layer itself (<code>#(2)<\/code>), and a standard linear layer (<code>#(3)<\/code>). The first one (<code>nn.Embedding<\/code>) is responsible for mapping every single token into a 512 (<code>EMBED_DIM<\/code>)-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 (<code>VOCAB_SIZE<\/code>)-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 6a\nclass LSTMDecoder(nn.Module):\n    def __init__(self):\n        super().__init__()\n\n        #(1)\n        self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,\n                                      embedding_dim=EMBED_DIM)\n        #(2)\n        self.lstm = nn.LSTM(input_size=EMBED_DIM, \n                            hidden_size=LSTM_HIDDEN_DIM, \n                            num_layers=NUM_LSTM_LAYERS, \n                            batch_first=True)\n        #(3)        \n        self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM, \n                                out_features=VOCAB_SIZE)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Next, let\u2019s define the flow of the network using the following code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 6b\n    def forward(self, features, captions):                 #(1)\n        print(f'features originalt: {features.size()}')\n        features = features.unsqueeze(1)                   #(2)\n        print(f\"after unsqueezett: {features.shape}\")\n\n        print(f'captions originalt: {captions.size()}')\n        captions = self.embedding(captions)                #(3)\n        print(f\"after embeddingtt: {captions.shape}\")\n\n        captions = torch.cat([features, captions], dim=1)  #(4)\n        print(f\"after concattt: {captions.shape}\")\n\n        captions, _ = self.lstm(captions)                  #(5)\n        print(f\"after lstmtt: {captions.shape}\")\n\n        captions = self.linear(captions)                   #(6)\n        print(f\"after lineartt: {captions.shape}\")\n\n        return captions<\/code><\/pre>\n<p class=\"wp-block-paragraph\">You can see in the above code that the <code>forward()<\/code> method of the <code>LSTMDecoder<\/code> class accepts two inputs: <code>features<\/code> and <code>captions<\/code>, where the former is the image that has been processed by the <code>InceptionEncoder<\/code>, while the latter is the caption of the corresponding image serving as the ground truth (<code>#(1)<\/code>). The idea here is that we are going to perform <em>pre-inject<\/em> operation by prepending the <code>features<\/code> tensor into <code>captions<\/code> using the code at line <code>#(4)<\/code>. However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image features (<code>#(2)<\/code>). Meanwhile, the shape of the <code>captions<\/code> tensor will align with our requirement right after being processed by the embedding layer (<code>#(3)<\/code>). As the <code>features<\/code> and <code>captions<\/code> have been concatenated, we then pass this tensor through the LSTM layer (<code>#(5)<\/code>) before it is eventually processed by the linear layer (<code>#(6)<\/code>). Look at the testing code below to better understand the flow of the two tensors.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 7\nlstm_decoder = LSTMDecoder()\n\nfeatures = torch.randn(BATCH_SIZE, EMBED_DIM)  #(1)\ncaptions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)\n\ncaptions = lstm_decoder(features, captions)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In Codeblock 7, I assume that <code>features<\/code> is a dummy tensor that represents the output of the <code>InceptionEncoder<\/code> model (<code>#(1)<\/code>). Meanwhile, <code>captions<\/code> is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 (<code>VOCAB_SIZE<\/code>) with the length of 30 (<code>SEQ_LENGTH<\/code>) (<code>#(2)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">We can see in the output below that the features tensor initially has the dimension of 1\u00d7512 (<code>#(1)<\/code>). This tensor shape changed to 1\u00d71\u00d7512 after being processed with the <code>unsqueeze()<\/code> operation (<code>#(2)<\/code>). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the <code>captions<\/code> tensor, its shape changed from 1\u00d730 (<code>#(3)<\/code>) to 1\u00d730\u00d7512 (<code>#(4)<\/code>), indicating that every single word is now represented as a 512-dimensional vector.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 7 Output\nfeatures original : torch.Size([1, 512])       #(1)\nafter unsqueeze   : torch.Size([1, 1, 512])    #(2)\ncaptions original : torch.Size([1, 30])        #(3)\nafter embedding   : torch.Size([1, 30, 512])   #(4)\nafter concat      : torch.Size([1, 31, 512])   #(5)\nafter lstm        : torch.Size([1, 31, 512])   #(6)\nafter linear      : torch.Size([1, 31, 10000]) #(7)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">After <em>pre-inject<\/em> operation is performed, our tensor is now having the dimension of 1\u00d731\u00d7512, where the <code>features<\/code> tensor becomes the token at the 0th timestep in the sequence (<code>#(5)<\/code>). See the following figure to better illustrate this idea.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e6d6bf\" data-has-transparency=\"true\" style=\"--dominant-color: #e6d6bf;\" loading=\"lazy\" decoding=\"async\" width=\"1830\" height=\"1113\" class=\"wp-image-597164 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A.png?resize=1830%2C1113&#038;ssl=1\" alt=\"Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A.png 1830w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A-300x182.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A-1024x623.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A-768x467.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1RhGAyYwE16KBpEtLGCz5_A-1536x934.png 1536w\" sizes=\"auto, (max-width: 1830px) 100vw, 1830px\"><figcaption class=\"wp-element-caption\">Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is important to note that the tensor shapes at line <code>#(5)<\/code> and <code>#(6)<\/code> in the above output are actually specified by different parameters. The dimensions appear to match here because <code>EMBED_DIM<\/code> and <code>LSTM_HIDDEN_DIM<\/code> were both set to 512. Normally, if we use a different value for <code>LSTM_HIDDEN_DIM<\/code>, then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted (<code>#(7)<\/code>).<\/p>\n<h3 class=\"wp-block-heading\">GoogLeNet Encoder + LSTM Decoder<\/h3>\n<p class=\"wp-block-paragraph\">At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the <code>ShowAndTell<\/code> class below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 8a\nclass ShowAndTell(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.encoder = InceptionEncoder(fine_tune=True)  #(1)\n        self.decoder = LSTMDecoder()     #(2)\n\n    def forward(self, images, captions):\n        features = self.encoder(images)  #(3)\n        print(f\"after encodert: {features.shape}\")\n\n        captions = self.decoder(features, captions)      #(4)\n        print(f\"after decodert: {captions.shape}\")\n\n        return captions<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I think the above code is pretty straightforward. In the <code>__init__()<\/code> method, we only need to initialize the <code>InceptionEncoder<\/code> as well as the <code>LSTMDecoder<\/code> models (<code>#(1)<\/code> and <code>#(2)<\/code>). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the <code>fine_tune<\/code> parameter to <code>True<\/code>. Theoretically speaking, fine-tuning is better than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead \u2013 but that\u2019s just the theory. It\u2019s definitely a good idea to experiment with both options to see which works best in your case.<\/p>\n<p class=\"wp-block-paragraph\">Still with the above codeblock, we configure the <code>forward()<\/code> method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block (<code>#(3)<\/code>). Afterwards, we pass the extracted features as well as the tokenized captions into the decoder block and let it produce another token sequence (<code>#(4)<\/code>). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.<\/p>\n<p class=\"wp-block-paragraph\">It is important to know that we cannot use the <code>forward()<\/code> method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the <code>generate()<\/code> method below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 8b\n    def generate(self, images):  #(1)\n        features = self.encoder(images)              #(2)\n        print(f\"after encodertt: {features.shape}n\")\n\n        words = []  #(3)\n        for i in range(SEQ_LENGTH):                  #(4)\n            print(f\"iteration #{i}\")\n            features = features.unsqueeze(1)\n            print(f\"after unsqueezett: {features.shape}\")\n\n            features, _ = self.decoder.lstm(features)\n            print(f\"after lstmtt: {features.shape}\")\n\n            features = features.squeeze(1)           #(5)\n            print(f\"after squeezett: {features.shape}\")\n\n            probs = self.decoder.linear(features)    #(6)\n            print(f\"after lineartt: {probs.shape}\")\n\n            _, word = probs.max(dim=1)  #(7)\n            print(f\"after maxtt: {word.shape}\")\n\n            words.append(word.item())  #(8)\n\n            if word == 1:  #(9)\n                break\n\n            features = self.decoder.embedding(word)  #(10)\n            print(f\"after embeddingtt: {features.shape}n\")\n\n        return words       #(11)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Instead of taking two inputs like the previous one, the <code>generate()<\/code> method takes raw image as the only input (<code>#(1)<\/code>). Since we want the features extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens (<code>#(2)<\/code>). Next, we allocate an empty list for storing the token sequence to be produced later (<code>#(3)<\/code>). The tokens themselves are generated one by one, so we wrap the entire process inside a <code>for<\/code> loop, which is going to stop iterating once it reaches at most 30 (<code>SEQ_LENGTH<\/code>) words (<code>#(4)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the <code>forward()<\/code> method of the <code>LSTMDecoder<\/code> class back in Codeblock 6b. The first difference you might notice is the <code>squeeze()<\/code> operation (<code>#(5)<\/code>), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly (<code>#(6)<\/code>). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next (<code>#(7)<\/code>), and append it to the list we allocated earlier (<code>#(8)<\/code>). The loop is going to break whenever the predicted index is a <em>stop token<\/em>, which in this case I assume that this token is at the 1st index of the <code>probs<\/code> vector. Otherwise, if the model does not find the <em>stop token<\/em>, then it is going to convert the last predicted word into its 512 (<code>EMBED_DIM<\/code>)-dimensional vector (<code>#(10)<\/code>), allowing it to be used as the input features for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed (<code>#(11)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the <code>show_and_tell<\/code> model (<code>#(1)<\/code>), each representing a raw image of size 3\u00d7224\u00d7224 (<code>#(2)<\/code>) and a sequence of tokenized words (<code>#(3)<\/code>). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the <code>InceptionEncoder<\/code> and the <code>LSTMDecoder<\/code> part of the network.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 9\nshow_and_tell = ShowAndTell()  #(1)\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)\ncaptions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))      #(3)\n\ncaptions = show_and_tell(images, captions)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 9 Output\nafter encoder : torch.Size([1, 512])\nafter decoder : torch.Size([1, 31, 10000])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now, let\u2019s assume that our <code>show_and_tell<\/code> model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to <code>eval()<\/code> mode (<code>#(1)<\/code>), initialize the input image (<code>#(2)<\/code>), and pass it through the model using the <code>generate()<\/code> method (<code>#(3)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 10\nshow_and_tell.eval()  #(1)\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)\n\nwith torch.no_grad():\n    generated_tokens = show_and_tell.generate(images)  #(3)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only shows the same token generation process 30 times.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 10 Output\nafter encoder    : torch.Size([1, 512])\n\niteration #0\nafter unsqueeze  : torch.Size([1, 1, 512])\nafter lstm       : torch.Size([1, 1, 512])\nafter squeeze    : torch.Size([1, 512])\nafter linear     : torch.Size([1, 10000])\nafter max        : torch.Size([1])\nafter embedding  : torch.Size([1, 512])\n\niteration #1\nafter unsqueeze  : torch.Size([1, 1, 512])\nafter lstm       : torch.Size([1, 1, 512])\nafter squeeze    : torch.Size([1, 512])\nafter linear     : torch.Size([1, 10000])\nafter max        : torch.Size([1])\nafter embedding  : torch.Size([1, 512])\n\n.\n.\n.\n.<\/code><\/pre>\n<p class=\"wp-block-paragraph\">To see what the resulting caption looks like, we can just print out the <code>generated_tokens<\/code> list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 11\ngenerated_tokens<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-csharp\"># Codeblock 11 Output\n[5627,\n 3906,\n 2370,\n 2299,\n 4952,\n 9933,\n 402,\n 7775,\n 602,\n 4414,\n 8667,\n 6774,\n 9345,\n 8750,\n 3680,\n 4458,\n 1677,\n 5998,\n 8572,\n 9556,\n 7347,\n 6780,\n 9672,\n 2596,\n 9218,\n 1880,\n 4396,\n 6168,\n 7999,\n 454]<\/code><\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<h2 class=\"wp-block-heading\">Ending<\/h2>\n<p class=\"wp-block-paragraph\">With the above output, we\u2019ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.<\/p>\n<p class=\"wp-block-paragraph\">Thanks for reading, I hope you learn something new today!<\/p>\n<p class=\"wp-block-paragraph\">_By the way you can also find the code used in this article <a href=\"https:\/\/github.com\/MuhammadArdiPutra\/medium_articles\/blob\/main\/Show%20and%20Tell.ipynb\">here<\/a>._<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/1411.4555\">https:\/\/arxiv.org\/pdf\/1411.4555<\/a> [Accessed November 13, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/1409.4842\">https:\/\/arxiv.org\/pdf\/1409.4842<\/a> [Accessed November 13, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/1703.09137\">https:\/\/arxiv.org\/pdf\/1703.09137<\/a> [Accessed November 13, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[4] <a href=\"https:\/\/towardsdatascience.com\/None\">Stepan Ulyanin<\/a>. Captioning Images with CNN and RNN, using PyTorch. Medium. <a href=\"https:\/\/medium.com\/@stepanulyanin\/captioning-images-with-pytorch-bc592e5fd1a3\">https:\/\/medium.com\/@stepanulyanin\/captioning-images-with-pytorch-bc592e5fd1a3<\/a> [Accessed November 16, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[5] <a href=\"https:\/\/towardsdatascience.com\/None\">Saketh Kotamraju<\/a>. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. <a href=\"https:\/\/towardsdatascience.com\/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c\">https:\/\/towardsdatascience.com\/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c<\/a> [Accessed November 16, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. <a href=\"https:\/\/www.youtube.com\/watch?v=htNmFL2BG34\">https:\/\/www.youtube.com\/watch?v=htNmFL2BG34<\/a> [Accessed November 16, 2024].<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/show-and-tell-e1a1142456e2\/\">Show and Tell<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Muhammad Ardi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/show-and-tell-e1a1142456e2\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Show and Tell Photo by St\u00e5le Grut on Unsplash Introduction Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,221,88,166,260],"tags":[845,548,1621],"class_list":["post-1638","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-computer-vision","category-deep-learning","category-hands-on-tutorials","category-nlp","tag-image","tag-more","tag-two"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1638"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1638"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1638\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1638"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1638"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1638"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}