{"id":2287,"date":"2025-03-08T07:02:20","date_gmt":"2025-03-08T07:02:20","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/08\/image-captioning-transformer-mode-on\/"},"modified":"2025-03-08T07:02:20","modified_gmt":"2025-03-08T07:02:20","slug":"image-captioning-transformer-mode-on","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/08\/image-captioning-transformer-mode-on\/","title":{"rendered":"Image Captioning, Transformer Mode On"},"content":{"rendered":"<p>    Image Captioning, Transformer Mode On<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n<p class=\"wp-block-paragraph\">In my previous article, I discussed one of the earliest <a href=\"https:\/\/towardsdatascience.com\/tag\/deep-learning\/\" title=\"Deep Learning\">Deep Learning<\/a> approaches for image captioning. If you\u2019re interested in reading it, you can find the link to that article at the end of this one.<\/p>\n<p class=\"wp-block-paragraph\">Today, I would like to talk about <a href=\"https:\/\/towardsdatascience.com\/tag\/image-captioning\/\" title=\"Image Captioning\">Image Captioning<\/a> again, but this time with the more advanced neural network architecture. The deep learning I am going to talk about is the one proposed in the paper titled \u201c<em>CPTR: Full Transformer Network for Image Captioning<\/em>,\u201d written by Liu <em>et al.<\/em> back in 2021 [1]. Specifically, here I will reproduce the model proposed in the paper and explain the underlying theory behind the architecture. However, keep in mind that I won\u2019t actually demonstrate the training process since I only want to focus on the model architecture.<\/p>\n<h3 class=\"wp-block-heading\">The idea behind CPTR<\/h3>\n<p class=\"wp-block-paragraph\">In fact, the main idea of the CPTR architecture is exactly the same as the earlier image captioning model, as both use the encoder-decoder structure. Previously, in the paper titled \u201c<em>Show and Tell: A Neural Image Caption Generator<\/em>\u201d [2], the models used are GoogLeNet (a.k.a. Inception V1) and LSTM for the two components, respectively. The illustration of the model proposed in the <em>Show and Tell<\/em> paper is shown in the following figure.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"eaebec\" data-has-transparency=\"true\" style=\"--dominant-color: #eaebec;\" loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"574\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-1.png?resize=720%2C574&#038;ssl=1\" alt=\"\" class=\"wp-image-599183 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-1.png 720w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-1-300x239.png 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\"><figcaption class=\"wp-element-caption\">Figure 1. The neural network architecture for image captioning proposed in the Show and Tell paper [2].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Despite having the same encoder-decoder structure, what makes CPTR different from the previous approach is the basis of the encoder and the decoder themselves. In CPTR, we combine the encoder part of the ViT (Vision <a href=\"https:\/\/towardsdatascience.com\/tag\/transformer\/\" title=\"Transformer\">Transformer<\/a>) model with the decoder part of the original Transformer model. The use of transformer-based architecture for both components is essentially where the name CPTR comes from: CaPtion TransformeR.<\/p>\n<p class=\"wp-block-paragraph\">Note that the discussions in this article are going to be highly related to ViT and Transformer, so I highly recommend you read my previous article about these two topics if you\u2019re not yet familiar with them. You can find the links at the end of this article.<\/p>\n<p class=\"wp-block-paragraph\">Figure 2 shows what the original ViT architecture looks like. Everything inside the green box is the encoder part of the architecture to be adopted as the CPTR encoder.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"d7ddd5\" data-has-transparency=\"true\" style=\"--dominant-color: #d7ddd5;\" loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"504\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-2.png?resize=720%2C504&#038;ssl=1\" alt=\"\" class=\"wp-image-599184 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-2.png 720w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-2-300x210.png 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\"><figcaption class=\"wp-element-caption\">Figure 2. The Vision Transformer (ViT) architecture [3].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Next, Figure 3 displays the original Transformer architecture. The components enclosed in the blue box are the layers that we are going to implement in the CPTR decoder.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"dcdfe0\" data-has-transparency=\"true\" style=\"--dominant-color: #dcdfe0;\" loading=\"lazy\" decoding=\"async\" width=\"554\" height=\"810\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-3.png?resize=554%2C810&#038;ssl=1\" alt=\"\" class=\"wp-image-599185 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-3.png 554w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-3-205x300.png 205w\" sizes=\"auto, (max-width: 554px) 100vw, 554px\"><figcaption class=\"wp-element-caption\">Figure 3. The original Transformer architecture [4].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">If we combine the components inside the green and blue boxes above, we are going to obtain the architecture shown in Figure 4 below. This is exactly what the CPTR model we are going to implement looks like. The idea here is that the ViT Encoder (green) works by encoding the input image into a specific tensor representation which will then be used as the basis of the Transformer Decoder (blue) to generate the corresponding caption.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e1e4df\" data-has-transparency=\"true\" style=\"--dominant-color: #e1e4df;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"608\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-4-1024x608.png?resize=1024%2C608&#038;ssl=1\" alt=\"\" class=\"wp-image-599186 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-4-1024x608.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-4-300x178.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-4-768x456.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-4.png 1080w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Figure 4. The CPTR architecture [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">That\u2019s pretty much everything you need to know for now. I\u2019ll explain more about the details as we go through the implementation.<\/p>\n<h2 class=\"wp-block-heading\">Module imports &amp; parameter configuration<\/h2>\n<p class=\"wp-block-paragraph\">As always, the first thing we need to do in the code is to import the required modules. In this case, we only import torch and torch.nn since we are about to implement the model from scratch.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 1\nimport torch\nimport torch.nn as nn<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Next, we are going to initialize some parameters in Codeblock 2. If you have read my previous article about image captioning with GoogLeNet and LSTM, you\u2019ll notice that here, we got a lot more parameters to initialize. In this article, I want to reproduce the CPTR model as closely as possible to the original one, so the parameters mentioned in the paper will be used in this implementation.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 2\nBATCH_SIZE         = 1              #(1)\n\nIMAGE_SIZE         = 384            #(2)\nIN_CHANNELS        = 3              #(3)\n\nSEQ_LENGTH         = 30             #(4)\nVOCAB_SIZE         = 10000          #(5)\n\nEMBED_DIM          = 768            #(6)\nPATCH_SIZE         = 16             #(7)\nNUM_PATCHES        = (IMAGE_SIZE\/\/PATCH_SIZE) ** 2  #(8)\nNUM_ENCODER_BLOCKS = 12             #(9)\nNUM_DECODER_BLOCKS = 4              #(10)\nNUM_HEADS          = 12             #(11)\nHIDDEN_DIM         = EMBED_DIM * 4  #(12)\nDROP_PROB          = 0.1            #(13)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first parameter I want to explain is the <code>BATCH_SIZE<\/code>, which is written at the line marked with <code>#(1)<\/code>. The number assigned to this variable is not quite important in our case since we are not actually going to train this model. This parameter is set to 1 because, by default, PyTorch treats input tensors as a batch of samples. Here I assume that we only have a single sample in a batch.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Next, remember that in the case of image captioning we are dealing with images and texts simultaneously. This essentially means that we need to set the parameters for the two. It is mentioned in the paper that the model accepts an RGB image of size 384\u00d7384 for the encoder input. Hence, we assign the values for <code>IMAGE_SIZE<\/code> and <code>IN_CHANNELS<\/code> variables based on this information (<code>#(2)<\/code> and <code>#(3)<\/code>). On the other hand, the paper does not mention the parameters for the captions. So, here I assume that the length of the caption is no more than 30 words (<code>#(4)<\/code>), with the vocabulary size estimated at 10000 unique words (<code>#(5)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">The remaining parameters are related to the model configuration. Here we set the <code>EMBED_DIM<\/code> variable to 768 (<code>#(6)<\/code>). In the encoder side, this number indicates the length of the feature vector that represents each 16\u00d716 image patch (<code>#(7)<\/code>). The same concept also applies to the decoder side, but in that case the feature vector will represent a single word in the caption. Talking more specifically about the <code>PATCH_SIZE<\/code> parameter, we are going to use the value to compute the total number of patches in the input image. Since the image has the size of 384\u00d7384, there will be 576 patches in total (<code>#(8)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">When it comes to using an encoder-decoder architecture, it is possible to specify the number of encoder and decoder blocks to be used. Using more blocks typically allows the model to perform better in terms of the accuracy, yet in return, it will require more computational power. The authors of this paper decided to stack 12 encoder blocks (<code>#(9)<\/code>) and 4 decoder blocks (<code>#(10)<\/code>). Next, since CPTR is a transformer-based model, it is necessary to specify the number of attention heads within the attention blocks inside the encoders and the decoders, which in this case authors use 12 attention heads (<code>#(11)<\/code>). The value for the <code>HIDDEN_DIM<\/code> parameter is not mentioned anywhere in the paper. However, according to the ViT and the Transformer paper, this parameter is configured to be 4 times larger than <code>EMBED_DIM<\/code> (<code>#(12)<\/code>). The dropout rate is not mentioned in the paper either. Hence, I arbitrarily set <code>DROP_PROB<\/code> to 0.1 (<code>#(13)<\/code>).<\/p>\n<h2 class=\"wp-block-heading\">Encoder<\/h2>\n<p class=\"wp-block-paragraph\">As the modules and parameters have been set up, now that we will get into the encoder part of the network. In this section we are going to implement and explain every single component inside the green box in Figure 4 one by one.<\/p>\n<h3 class=\"wp-block-heading\">Patch embedding<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f3f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f3f1;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-5.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599187 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-5.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-5-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-5-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 5. Dividing the input image into patches and converting them into vectors [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can see in Figure 5 above that the first step to be done is dividing the input image into patches. This is essentially done because instead of focusing on local patterns like CNNs, ViT captures global context by learning the relationships between these patches. We can model this process with the <code>Patcher<\/code> class shown in the Codeblock 3 below. For the sake of simplicity, here I also include the process inside the <em>patch embedding<\/em> block within the same class.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 3\nclass Patcher(nn.Module):\n   def __init__(self):\n       super().__init__()\n\n       #(1)\n       self.unfold = nn.Unfold(kernel_size=PATCH_SIZE, stride=PATCH_SIZE)\n\n       #(2)\n       self.linear_projection = nn.Linear(in_features=IN_CHANNELS*PATCH_SIZE*PATCH_SIZE,\n                                          out_features=EMBED_DIM)\n      \n   def forward(self, images):\n       print(f'imagestt: {images.size()}')\n       images = self.unfold(images)  #(3)\n       print(f'after unfoldt: {images.size()}')\n      \n       images = images.permute(0, 2, 1)  #(4)\n       print(f'after permutet: {images.size()}')\n      \n       features = self.linear_projection(images)  #(5)\n       print(f'after lin projt: {features.size()}')\n      \n       return features<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The patching itself is done using the <code>nn.Unfold<\/code> layer (<code>#(1)<\/code>). Here we need to set both the <code>kernel_size<\/code> and <code>stride<\/code> parameters to <code>PATCH_SIZE (16)<\/code> so that the resulting patches do not overlap with each other. This layer also automatically flattens these patches once it is applied to the input image. Meanwhile, the <code>nn.Linear layer<\/code> (<code>#(2)<\/code>) is employed to perform linear projection, i.e., the process done by the <em>patch embedding<\/em> block. By setting the <code>out_features<\/code> parameter to <code>EMBED_DIM<\/code>, this layer will map every single flattened patch into a feature vector of length 768.<\/p>\n<p class=\"wp-block-paragraph\">The entire process should make more sense once you read the <code>forward()<\/code> method. You can see at line <code>#(3)<\/code> in the same codeblock that the input image is directly processed by the unfold layer. Next, we need to process the resulting tensor with the <code>permute()<\/code> method (<code>#(4)<\/code>) to swap the first and the second axis before feeding it to the <code>linear_projection<\/code> layer (<code>#(5)<\/code>). Additionally, here I also print out the tensor dimension after each layer so that you can better understand the transformation made at each step.<\/p>\n<p class=\"wp-block-paragraph\">In order to check if our <code>Patcher<\/code> class works properly, we can just pass a dummy tensor through the network. Look at the Codeblock 4 below to see how I do it.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4\npatcher  = Patcher()\n\nimages   = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)\nfeatures = patcher(images)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4 Output\nimages         : torch.Size([1, 3, 384, 384])\nafter unfold   : torch.Size([1, 768, 576])  #(1)\nafter permute  : torch.Size([1, 576, 768])  #(2)\nafter lin proj : torch.Size([1, 576, 768])  #(3)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The tensor I passed above represents an RGB image of size 384\u00d7384. Here we can see that after the unfold operation is performed, the tensor dimension changed to 1\u00d7768\u00d7576 (<code>#(1)<\/code>), denoting the flattened 3\u00d716\u00d716 patch for each of the 576 patches. Unfortunately, this output shape does not match what we need. Remember that in ViT, we perceive image patches as a sequence, so we need to swap the 1st and 2nd axes because typically, the 1st dimension of a tensor represents the temporal axis, while the 2nd one represents the feature vector of each timestep. As the <code>permute()<\/code> operation is performed, our tensor is now having the dimension of 1\u00d7576\u00d7768 (<code>#(2)<\/code>). Lastly, we pass this tensor through the linear projection layer, which the resulting tensor shape remains the same since we set the <code>EMBED_DIM<\/code> parameter to the same size (768) (<code>#(3)<\/code>). Despite having the same dimension, the information contained in the final tensor should be richer thanks to the transformation applied by the trainable weights of the linear projection layer.<\/p>\n<h3 class=\"wp-block-heading\">Learnable positional embedding<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"fafaf9\" data-has-transparency=\"true\" style=\"--dominant-color: #fafaf9;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-6.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599188 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-6.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-6-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-6-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 6. Injecting the learnable positional embeddings into the embedded image patches [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">After the input image has successfully been converted into a sequence of patches, the next thing to do is to inject the so-called <em>positional embedding<\/em> tensor. This is essentially done because a transformer without positional embedding is permutation-invariant, meaning that it treats the input sequence as if their order does not matter. Interestingly, since an image is not a literal sequence, we should set the positional embedding to be <em>learnable<\/em> such that it will be able to somewhat reorder the patch sequence that it thinks works best in representing the spatial information. However, keep in mind that the term \u201creordering\u201d here does not mean that we physically rearrange the sequence. Rather, it does so by adjusting the embedding weights.<\/p>\n<p class=\"wp-block-paragraph\">The implementation is pretty simple. All we need to do is just to initialize a tensor using <code>nn.Parameter<\/code> which the dimension is set to match with the output from the <code>Patcher<\/code> model, i.e., 576\u00d7768. Also, don\u2019t forget to write <code>requires_grad=True<\/code> just to ensure that the tensor is trainable. Look at the Codeblock 5 below for the details.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 5\nclass LearnableEmbedding(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.learnable_embedding = nn.Parameter(torch.randn(size=(NUM_PATCHES, EMBED_DIM)),\n                                               requires_grad=True)\n      \n   def forward(self):\n       pos_embed = self.learnable_embedding\n       print(f'learnable embeddingt: {pos_embed.size()}')\n      \n       return pos_embed<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now let\u2019s run the following codeblock to see whether our <code>LearnableEmbedding<\/code> class works properly. You can see in the printed output that it successfully created the positional embedding tensor as expected.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 6\nlearnable_embedding = LearnableEmbedding()\n\npos_embed = learnable_embedding()<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># Codeblock 6 Output\nlearnable embedding : torch.Size([576, 768])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">The main encoder block<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f8f7f6\" data-has-transparency=\"true\" style=\"--dominant-color: #f8f7f6;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-7.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599189 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-7.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-7-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-7-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 7. The main encoder block [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The next thing we are going to do is to construct the main encoder block displayed in the Figure 7 above. Here you can see that this block consists of several sub-components, namely <em>self-attention<\/em>, <em>layer norm<\/em>, FFN (Feed-Forward Network), and another <em>layer norm<\/em>. The Codeblock 7a below shows how I initialize these layers inside the <code>__init__()<\/code> method of the <code>EncoderBlock<\/code> class.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 7a\nclass EncoderBlock(nn.Module):\n   def __init__(self):\n       super().__init__()\n      \n       #(1)\n       self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,\n                                                   num_heads=NUM_HEADS,\n                                                   batch_first=True,  #(2)\n                                                   dropout=DROP_PROB)\n      \n       self.layer_norm_0 = nn.LayerNorm(EMBED_DIM)  #(3)\n      \n       self.ffn = nn.Sequential(  #(4)\n           nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),\n           nn.GELU(),\n           nn.Dropout(p=DROP_PROB),\n           nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),\n       )\n      \n       self.layer_norm_1 = nn.LayerNorm(EMBED_DIM)  #(5)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I\u2019ve previously mentioned that the idea of ViT is to capture the relationships between patches within an image. This process is done by the <em>multihead attention<\/em> layer I initialize at line <code>#(1)<\/code> in the above codeblock. One thing to keep in mind here is that we need to set the batch_first parameter to <code>True<\/code> (<code>#(2)<\/code>). This is essentially done so that the attention layer will be compatible with our tensor shape, in which the batch dimension (<code>batch_size<\/code>) is at the 0th axis of the tensor. Next, the two layer normalization layers need to be initialized separately, as shown at line <code>#(3)<\/code> and <code>#(5)<\/code>. Lastly, we initialize the FFN block at line <code>#(4)<\/code>, which the layers stacked using <code>nn.Sequential<\/code> follows the structure defined in the following equation.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"eaeaea\" data-has-transparency=\"false\" style=\"--dominant-color: #eaeaea;\" loading=\"lazy\" decoding=\"async\" width=\"582\" height=\"52\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-8.png?resize=582%2C52&#038;ssl=1\" alt=\"\" class=\"wp-image-599190 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-8.png 582w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-8-300x27.png 300w\" sizes=\"auto, (max-width: 582px) 100vw, 582px\"><figcaption class=\"wp-element-caption\">Figure 8. The operations done inside the FFN block [1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As the <code>__init__()<\/code> method is complete, we will now continue with the <code>forward()<\/code> method. Let\u2019s take a look at the Codeblock 7b below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 7b\n   def forward(self, features):  #(1)\n      \n       residual = features  #(2)\n       print(f'features &amp; residualt: {residual.size()}')\n      \n       #(3)\n       features, self_attn_weights = self.self_attention(query=features,\n                                                         key=features,\n                                                         value=features)\n       print(f'after self attentiont: {features.size()}')\n       print(f\"self attn weightst: {self_attn_weights.shape}\")\n      \n       features = self.layer_norm_0(features + residual)  #(4)\n       print(f'after normtt: {features.size()}')\n      \n\n       residual = features\n       print(f'nfeatures &amp; residualt: {residual.size()}')\n      \n       features = self.ffn(features)  #(5)\n       print(f'after ffntt: {features.size()}')\n      \n       features = self.layer_norm_1(features + residual)\n       print(f'after normtt: {features.size()}')\n      \n       return features<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here you can see that the input tensor is named <code>features (#(1)<\/code>). I name it this way because the input of the <code>EncoderBlock<\/code> is the image that has already been processed with <code>Patcher<\/code> and <code>LearnableEmbedding<\/code>, instead of a raw image. Before doing anything, notice in the <code>encoder<\/code> block that there is a branch separated from the main flow which then returns back to the normalization layer. This branch is commonly known as a <em>residual connection<\/em>. To implement this, we need to store the original input tensor to the residual variable as I demonstrate at line <code>#(2)<\/code>. As the input tensor has been copied, now we are ready to process the original input with the multihead attention layer (<code>#(3)<\/code>). Since this is a <em>self<\/em>-attention (not a <em>cross<\/em>-attention), the <code>query<\/code>, <code>key<\/code>, and <code>value<\/code> inputs for this layer are all derived from the <code>features<\/code> tensor. Next, the layer normalization operation is then performed at line <code>#(4)<\/code>, which the input for this layer already contains information from the attention block as well as the residual connection. The remaining steps are basically the same as what I just explained, except that here we replace the self-attention block with FFN (<code>#(5)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">In the following codeblock, I\u2019ll test the <code>EncoderBlock<\/code> class by passing a dummy tensor of size 1\u00d7576\u00d7768, simulating an output tensor from the previous operations.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 8\nencoder_block = EncoderBlock()\n\nfeatures = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)\nfeatures = encoder_block(features)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Below is what the tensor dimension looks like throughout the entire process inside the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 8 Output\nfeatures &amp; residual  : torch.Size([1, 576, 768])  #(1)\nafter self attention : torch.Size([1, 576, 768])\nself attn weights    : torch.Size([1, 576, 576])  #(2)\nafter norm           : torch.Size([1, 576, 768])\n\nfeatures &amp; residual  : torch.Size([1, 576, 768])\nafter ffn            : torch.Size([1, 576, 768])  #(3)\nafter norm           : torch.Size([1, 576, 768])  #(4)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here you can see that the final output tensor (<code>#(4)<\/code>) has the same size as the input (<code>#(1)<\/code>), allowing us to stack multiple encoder blocks without having to worry about messing up the tensor dimensions. Not only that, the size of the tensor also appears to be unchanged from the beginning all the way to the last layer. In fact, there are actually lots of transformations performed inside the attention block, but we just can\u2019t see it since the entire process is done internally by the <code>nn.MultiheadAttention<\/code> layer. One of the tensors produced in the layer that we can observe is the attention weight (<code>#(2)<\/code>). This weight matrix, which has the size of 576\u00d7576, is responsible for storing information regarding the relationships between one patch and every other patch in the image. Furthermore, changes in tensor dimension actually also happened inside the FFN layer. The feature vector of each patch which has the initial length of 768 changed to 3072 and immediately shrunk back to 768 again (<code>#(3)<\/code>). However, this transformation is not printed since the process is wrapped with <code>nn.Sequential<\/code> back at line #(4) in Codeblock 7a.<\/p>\n<h3 class=\"wp-block-heading\">ViT encoder<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0eeec\" data-has-transparency=\"true\" style=\"--dominant-color: #f0eeec;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-9.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599191 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-9.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-9-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-9-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 9. The entire ViT Encoder in the CPTR architecture [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As we have finished implementing all encoder components, now that we will assemble them to construct the actual ViT Encoder. We are going to do it in the <code>Encoder<\/code> class in Codeblock 9.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 9\nclass Encoder(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.patcher = Patcher()  #(1)\n       self.learnable_embedding = LearnableEmbedding()  #(2)\n\n       #(3)\n       self.encoder_blocks = nn.ModuleList(EncoderBlock() for _ in range(NUM_ENCODER_BLOCKS))\n  \n   def forward(self, images):  #(4)\n       print(f'imagesttt: {images.size()}')\n      \n       features = self.patcher(images)  #(5)\n       print(f'after patchertt: {features.size()}')\n      \n       features = features + self.learnable_embedding()  #(6)\n       print(f'after learn embedt: {features.size()}')\n      \n       for i, encoder_block in enumerate(self.encoder_blocks):\n           features = encoder_block(features)  #(7)\n           print(f\"after encoder block #{i}t: {features.shape}\")\n\n       return features<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Inside the<code> __init__()<\/code> method, what we need to do is to initialize all components we created earlier, i.e., <code>Patcher<\/code> (<code>#(1)<\/code>), <code>LearnableEmbedding<\/code> (<code>#(2)<\/code>), and <code>EncoderBlock<\/code> (<code>#(3)<\/code>). In this case, the <code>EncoderBlock<\/code> is initialized inside <code>nn.ModuleList<\/code> since we want to repeat it <code>NUM_ENCODER_BLOCKS<\/code> (12) times. To the <code>forward()<\/code> method, it initially works by accepting raw image as the input (<code>#(4)<\/code>). We then process it with the <code>patcher<\/code> layer (<code>#(5)<\/code>) to divide the image into small patches and transform them with the linear projection operation. The learnable positional embedding tensor is then injected into the resulting output by element-wise addition (<code>#(6)<\/code>). Lastly, we pass it into the 12 encoder blocks sequentially with a simple for loop (<code>#(7)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">Now, in Codeblock 10, I am going to pass a dummy image through the entire encoder. Note that since I want to focus on the flow of this Encoder class, I re-run the previous classes we created earlier with the <code>print()<\/code> functions commented out so that the outputs will look neat.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 10\nencoder = Encoder()\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)\nfeatures = encoder(images)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">And below is what the flow of the tensor looks like. Here, we can see that our dummy input image successfully passed through all layers in the network, including the encoder blocks that we repeat 12 times. The resulting output tensor is now context-aware, meaning that it already contains information about the relationships between patches within the image. Therefore, this tensor is now ready to be processed further with the decoder, which will later be discussed in the subsequent section.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 10 Output\nimages                  : torch.Size([1, 3, 384, 384])\nafter patcher           : torch.Size([1, 576, 768])\nafter learn embed       : torch.Size([1, 576, 768])\nafter encoder block #0  : torch.Size([1, 576, 768])\nafter encoder block #1  : torch.Size([1, 576, 768])\nafter encoder block #2  : torch.Size([1, 576, 768])\nafter encoder block #3  : torch.Size([1, 576, 768])\nafter encoder block #4  : torch.Size([1, 576, 768])\nafter encoder block #5  : torch.Size([1, 576, 768])\nafter encoder block #6  : torch.Size([1, 576, 768])\nafter encoder block #7  : torch.Size([1, 576, 768])\nafter encoder block #8  : torch.Size([1, 576, 768])\nafter encoder block #9  : torch.Size([1, 576, 768])\nafter encoder block #10 : torch.Size([1, 576, 768])\nafter encoder block #11 : torch.Size([1, 576, 768])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">ViT encoder (alternative)<\/h3>\n<p class=\"wp-block-paragraph\">I want to show you something before we talk about the decoder. If you think that our approach above is too complicated, it is actually possible for you to use <code>nn.TransformerEncoderLayer<\/code> from PyTorch so that you don\u2019t need to implement the <code>EncoderBlock<\/code> class from scratch. To do so, I am going to reimplement the <code>Encoder<\/code> class, but this time I\u2019ll name it <code>EncoderTorch<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 11\nclass EncoderTorch(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.patcher = Patcher()\n       self.learnable_embedding = LearnableEmbedding()\n      \n       #(1)\n       encoder_block = nn.TransformerEncoderLayer(d_model=EMBED_DIM,\n                                                  nhead=NUM_HEADS,\n                                                  dim_feedforward=HIDDEN_DIM,\n                                                  dropout=DROP_PROB,\n                                                  batch_first=True)\n      \n       #(2)\n       self.encoder_blocks = nn.TransformerEncoder(encoder_layer=encoder_block,\n                                                   num_layers=NUM_ENCODER_BLOCKS)\n  \n   def forward(self, images):\n       print(f'imagesttt: {images.size()}')\n      \n       features = self.patcher(images)\n       print(f'after patchertt: {features.size()}')\n      \n       features = features + self.learnable_embedding()\n       print(f'after learn embedt: {features.size()}')\n      \n       features = self.encoder_blocks(features)  #(3)\n       print(f'after encoder blockst: {features.size()}')\n\n       return features<\/code><\/pre>\n<p class=\"wp-block-paragraph\">What we basically do in the above codeblock is that instead of using the EncoderBlock class, here we use <code>nn.TransformerEncoderLayer<\/code> (<code>#(1)<\/code>), which will automatically create a single encoder block based on the parameters we pass to it. To repeat it multiple times, we can just use nn.<code>TransformerEncoder<\/code> and pass a number to the <code>num_layers<\/code> parameter (<code>#(2)<\/code>). With this approach, we don\u2019t necessarily need to write the forward pass in a loop like what we did earlier (<code>#(3)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">The testing code in the Codeblock 12 below is exactly the same as the one in Codeblock 10, except that here I use the <code>EncoderTorch<\/code> class. You can also see here that the output is basically the same as the previous one.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 12\nencoder_torch = EncoderTorch()\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)\nfeatures = encoder_torch(images)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 12 Output\nimages               : torch.Size([1, 3, 384, 384])\nafter patcher        : torch.Size([1, 576, 768])\nafter learn embed    : torch.Size([1, 576, 768])\nafter encoder blocks : torch.Size([1, 576, 768])<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Decoder<\/h2>\n<p class=\"wp-block-paragraph\">As we have successfully created the encoder part of the CPTR architecture, now that we will talk about the decoder. In this section I am going to implement every single component inside the blue box in Figure 4. Based on the figure, we can see that the decoder accepts two inputs, i.e., the image caption ground truth (the lower part of the blue box) and the sequence of embedded patches produced by the encoder (the arrow coming from the green box). It is important to know that the architecture drawn in Figure 4 is intended to illustrate the training phase, where the entire caption ground truth is fed into the decoder. Later in the inference phase, we only provide a &lt;BOS&gt; (<em>Beginning of Sentence<\/em>) token for the caption input. The decoder will then predict each word sequentially based on the given image and the previously generated words. This process is commonly known as an <em>autoregressive<\/em> mechanism.<\/p>\n<h3 class=\"wp-block-heading\">Sinusoidal positional embedding<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"fafaf9\" data-has-transparency=\"true\" style=\"--dominant-color: #fafaf9;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-10.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599192 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-10.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-10-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-10-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 10. Where the sinusoidal positional embedding component is located in the decoder [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">If you take a look at the CPTR model, you\u2019ll see that the first step in the decoder is to convert each word into the corresponding feature vector representation using the <em>word embedding<\/em> block. However, since this step is very easy, we are going to implement it later. Now let\u2019s assume that this word vectorization process is already done, so we can move to the positional embedding part.<\/p>\n<p class=\"wp-block-paragraph\">As I\u2019ve mentioned earlier, since transformer is permutation-invariant by nature, we need to apply positional embedding to the input sequence. Different from the previous one, here we use the so-called <em>sinusoidal positional embedding<\/em>. We can think of it like a method to label each word vector by assigning numbers obtained from a sinusoidal wave. By doing so, we can expect our model to understand word orders thanks to the information given by the wave patterns.<\/p>\n<p class=\"wp-block-paragraph\">If you go back to Codeblock 6 Output, you\u2019ll see that the positional embedding tensor in the encoder has the size of <code>NUM_PATCHES<\/code> \u00d7 <code>EMBED_DIM<\/code> (576\u00d7768). What we basically want to do in the decoder is to create a tensor having the size of <code>SEQ_LENGTH<\/code> \u00d7 <code>EMBED_DIM<\/code> (30\u00d7768), which the values are computed based on the equation shown in Figure 11. This tensor is then set to be non-trainable because a sequence of words must maintain a fixed order to preserve its meaning.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f4f4\" data-has-transparency=\"false\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"149\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-11.png?resize=900%2C149&#038;ssl=1\" alt=\"\" class=\"wp-image-599193 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-11.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-11-300x50.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-11-768x127.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 11. The equation for creating sinusoidal positional encoding proposed in the Transformer paper [6].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Here I want to explain the following code quickly because I actually have discussed this more thoroughly in my previous article about Transformer. Generally speaking, what we basically do here is to create the sine and cosine wave using <code>torch.sin()<\/code> (<code>#(1)<\/code>) and <code>torch.cos()<\/code> (<code>#(2)<\/code>). The resulting two tensors are then merged using the code at line <code>#(3)<\/code> and <code>#(4)<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 13\nclass SinusoidalEmbedding(nn.Module):\n   def forward(self):\n       pos = torch.arange(SEQ_LENGTH).reshape(SEQ_LENGTH, 1)\n       print(f\"postt: {pos.shape}\")\n      \n       i = torch.arange(0, EMBED_DIM, 2)\n       denominator = torch.pow(10000, i\/EMBED_DIM)\n       print(f\"denominatort: {denominator.shape}\")\n      \n       even_pos_embed = torch.sin(pos\/denominator)  #(1)\n       odd_pos_embed  = torch.cos(pos\/denominator)  #(2)\n       print(f\"even_pos_embedt: {even_pos_embed.shape}\")\n      \n       stacked = torch.stack([even_pos_embed, odd_pos_embed], dim=2)  #(3)\n       print(f\"stackedtt: {stacked.shape}\")\n\n       pos_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)\n       print(f\"pos_embedt: {pos_embed.shape}\")\n      \n       return pos_embed<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now we can check if the <code>SinusoidalEmbedding<\/code> class above works properly by running the Codeblock 14 below. As expected earlier, here you can see that the resulting tensor has the size of 30\u00d7768. This dimension matches with the tensor obtained by the process done in the <em>word embedding<\/em> block, allowing them to be summed in an element-wise manner.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 14\nsinusoidal_embedding = SinusoidalEmbedding()\npos_embed = sinusoidal_embedding()<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 14 Output\npos            : torch.Size([30, 1])\ndenominator    : torch.Size([384])\neven_pos_embed : torch.Size([30, 384])\nstacked        : torch.Size([30, 384, 2])\npos_embed      : torch.Size([30, 768])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Look-ahead mask<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"faf9f9\" data-has-transparency=\"true\" style=\"--dominant-color: #faf9f9;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-12.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599194 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-12.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-12-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-12-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 12. A look-ahead mask needs to be applied to the masked-self attention layer [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The next thing I am going to talk about in the decoder is the <em>masked self-attention<\/em> layer highlighted in the above figure. I am not going to code the attention mechanism from scratch. Rather, I\u2019ll only implement the so-called <em>look-ahead mask<\/em>, which will be useful for the self-attention layer so that it doesn\u2019t attend to the subsequent words in the caption during the training phase.<\/p>\n<p class=\"wp-block-paragraph\">The way to do it is pretty easy, what we need to do is just to create a triangular matrix which the size is set to match with the attention weight matrix, i.e., <code>SEQ_LENGTH<\/code> \u00d7 <code>SEQ_LENGTH<\/code> (30\u00d730). Look at the <code>create_mask()<\/code>function below for the details.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 15\ndef create_mask(seq_length):\n   mask = torch.tril(torch.ones((seq_length, seq_length)))  #(1)\n   mask[mask == 0] = -float('inf')  #(2)\n   mask[mask == 1] = 0  #(3)\n   return mask<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Even though creating a triangular matrix can simply be done with <code>torch.tril()<\/code> and <code>torch.ones()<\/code> (<code>#(1)<\/code>), but here we need to make a little modification by changing the 0 values to <em>-inf<\/em> (<code>#(2)<\/code>) and the 1s to 0 (<code>#(3)<\/code>). This is essentially done because the <code>nn.MultiheadAttention<\/code> layer applies the mask by element-wise addition. By assigning <em>-inf<\/em> to the subsequent words, the attention mechanism will completely ignore them. Again, the internal process inside an attention layer has also been discussed in detail in<a href=\"https:\/\/towardsdatascience.com\/paper-walkthrough-attention-is-all-you-need-80399cdc59e1\"> my previous article about transformer<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Now I am going to run the function with <code>seq_length=7<\/code> so that you can see what the mask actually looks like. Later in the complete flow, we need to set the <code>seq_length<\/code> parameter to <code>SEQ_LENGTH<\/code> (30) so that it matches with the actual caption length.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 16\nmask_example = create_mask(seq_length=7)\nmask_example<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 16 Output\ntensor([[0., -inf, -inf, -inf, -inf, -inf, -inf],\n       [0., 0., -inf, -inf, -inf, -inf, -inf],\n       [0., 0., 0., -inf, -inf, -inf, -inf],\n       [0., 0., 0., 0., -inf, -inf, -inf],\n       [0., 0., 0., 0., 0., -inf, -inf],\n       [0., 0., 0., 0., 0., 0., -inf],\n       [0., 0., 0., 0., 0., 0., 0.]])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">The main decoder block<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f7f6f5\" data-has-transparency=\"true\" style=\"--dominant-color: #f7f6f5;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-13.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599195 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-13.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-13-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-13-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 13. The main decoder block [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can see in the above figure that the structure of the decoder block is a bit longer than that of the encoder block. It seems like everything is nearly the same, except that the decoder part has a <em>cross-attention<\/em> mechanism and an additional layer normalization step placed after it. This cross-attention layer can actually be perceived as the bridge between the encoder and the decoder, as it is employed to capture the relationships between each word in the caption and every single patch in the input image. The two arrows coming from the encoder are the <em>key<\/em> and <em>value<\/em> inputs for the attention layer, whereas the <em>query<\/em> is derived from the previous layer in the decoder itself. Look at the Codeblock 17a and 17b below to see the implementation of the entire decoder block.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 17a\nclass DecoderBlock(nn.Module):\n   def __init__(self):\n       super().__init__()\n      \n       #(1)\n       self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,\n                                                   num_heads=NUM_HEADS,\n                                                   batch_first=True,\n                                                   dropout=DROP_PROB)\n       #(2)\n       self.layer_norm_0 = nn.LayerNorm(EMBED_DIM)\n       #(3)\n       self.cross_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,\n                                                    num_heads=NUM_HEADS,\n                                                    batch_first=True,\n                                                    dropout=DROP_PROB)\n\n       #(4)\n       self.layer_norm_1 = nn.LayerNorm(EMBED_DIM)\n      \n       #(5)      \n       self.ffn = nn.Sequential(\n           nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),\n           nn.GELU(),\n           nn.Dropout(p=DROP_PROB),\n           nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),\n       )\n      \n       #(6)\n       self.layer_norm_2 = nn.LayerNorm(EMBED_DIM)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In the <code>__init__()<\/code> method, we first initialize both self-attention (<code>#(1)<\/code>) and cross-attention (<code>#(3)<\/code>) layers with <code>nn.MultiheadAttention<\/code>. These two layers appear to be exactly the same now, but later you\u2019ll see the difference in the <code>forward()<\/code> method. The three layer normalization operations are initialized separately as shown at line <code>#(2)<\/code>, <code>#(4)<\/code> and <code>#(6)<\/code>, since each of them will contain different normalization parameters. Lastly, the <code>ffn<\/code> layer (<code>#(5)<\/code>) is exactly the same as the one in the encoder, which basically follows the equation back in Figure 8.<\/p>\n<p class=\"wp-block-paragraph\">Talking about the <code>forward()<\/code> method below, it initially works by accepting three inputs: <code>features<\/code>, <code>captions<\/code>, and <code>attn_mask<\/code>, which each of them denotes the tensor coming from the encoder, the tensor from the decoder itself, and a look-ahead mask, respectively (<code>#(1)<\/code>). The remaining steps are somewhat similar to that of the <code>EncoderBlock<\/code>, except that here we repeat the multihead attention block twice. The first attention mechanism takes captions as the <code>query<\/code>, <code>key<\/code>, and <code>value<\/code> parameters (<code>#(2)<\/code>). This is essentially done because we want the layer to capture the context within the captions tensor itself\u200a\u2014\u200ahence the name <em>self-attention<\/em>. Here we also need to pass the attn_mask parameter to this layer so that it cannot see the subsequent words during the training phase. The second attention mechanism is different (<code>#(3)<\/code>). Since we want to combine the information from the encoder and the decoder, we need to pass the <code>captions<\/code> tensor as the <code>query<\/code>, whereas the <code>features<\/code> tensor will be passed as the <code>key<\/code> and <code>value<\/code>\u200a\u2014\u200ahence the name <em>cross-attention<\/em>. A look-ahead mask is not necessary in the cross-attention layer since later in the inference phase the model will be able to see the entire input image at once rather than looking at the patches one by one. As the tensor has been processed by the two attention layers, we will then pass it through the feed forward network (<code>#(4)<\/code>). Lastly, don\u2019t forget to create the residual connections and apply the layer normalization steps after each sub-component.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 17b\n   def forward(self, features, captions, attn_mask):  #(1)\n       print(f\"attn_masktt: {attn_mask.shape}\")\n       residual = captions\n       print(f\"captions &amp; residualt: {captions.shape}\")\n      \n       #(2)\n       captions, self_attn_weights = self.self_attention(query=captions,\n                                                         key=captions,\n                                                         value=captions,\n                                                         attn_mask=attn_mask)\n       print(f\"after self attentiont: {captions.shape}\")\n       print(f\"self attn weightst: {self_attn_weights.shape}\")\n      \n       captions = self.layer_norm_0(captions + residual)\n       print(f\"after normtt: {captions.shape}\")\n      \n      \n       print(f\"nfeaturestt: {features.shape}\")\n       residual = captions\n       print(f\"captions &amp; residualt: {captions.shape}\")\n      \n       #(3)\n       captions, cross_attn_weights = self.cross_attention(query=captions,\n                                                           key=features,\n                                                           value=features)\n       print(f\"after cross attentiont: {captions.shape}\")\n       print(f\"cross attn weightst: {cross_attn_weights.shape}\")\n      \n       captions = self.layer_norm_1(captions + residual)\n       print(f\"after normtt: {captions.shape}\")\n      \n       residual = captions\n       print(f\"ncaptions &amp; residualt: {captions.shape}\")\n      \n       captions = self.ffn(captions)  #(4)\n       print(f\"after ffntt: {captions.shape}\")\n      \n       captions = self.layer_norm_2(captions + residual)\n       print(f\"after normtt: {captions.shape}\")\n      \n       return captions\n\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As the <code>DecoderBlock<\/code> class is completed, we can now test it with the following code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 18\ndecoder_block = DecoderBlock()\n\nfeatures = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)  #(1)\ncaptions = torch.randn(BATCH_SIZE, SEQ_LENGTH, EMBED_DIM)   #(2)\nlook_ahead_mask = create_mask(seq_length=SEQ_LENGTH)  #(3)\n\ncaptions = decoder_block(features, captions, look_ahead_mask)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here we assume that features is a tensor containing a sequence of patch embeddings produced by the <code>encoder<\/code> (<code>#(1)<\/code>), while captions is a sequence of embedded words (<code>#(2)<\/code>). The <code>seq_length<\/code> parameter of the look-ahead mask is set to <code>SEQ_LENGTH<\/code> (30) to match it to the number of words in the caption (<code>#(3)<\/code>). The tensor dimensions after each step are displayed in the following output.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 18 Output\nattn_mask             : torch.Size([30, 30])\ncaptions &amp; residual   : torch.Size([1, 30, 768])\nafter self attention  : torch.Size([1, 30, 768])\nself attn weights     : torch.Size([1, 30, 30])    #(1)\nafter norm            : torch.Size([1, 30, 768])\n\nfeatures              : torch.Size([1, 576, 768])\ncaptions &amp; residual   : torch.Size([1, 30, 768])\nafter cross attention : torch.Size([1, 30, 768])\ncross attn weights    : torch.Size([1, 30, 576])   #(2)\nafter norm            : torch.Size([1, 30, 768])\n\ncaptions &amp; residual   : torch.Size([1, 30, 768])\nafter ffn             : torch.Size([1, 30, 768])\nafter norm            : torch.Size([1, 30, 768])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here we can see that our <code>DecoderBlock<\/code> class works properly as it successfully processed the input tensors all the way to the last layer in the network. Here I want you to take a closer look at the attention weights at lines <code>#(1)<\/code> and <code>#(2)<\/code>. Based on these two lines, we can confirm that our decoder implementation is correct since the attention weight produced by the self-attention layer has the size of 30\u00d730 (<code>#(1)<\/code>), which basically means that this layer really captured the context within the input caption. Meanwhile, the attention weight matrix generated by the cross-attention layer has the size of 30\u00d7576 (<code>#(2)<\/code>), indicating that it successfully captured the relationships between the words and the patches. This essentially implies that after cross-attention operation is performed, the resulting captions tensor has been enriched with the information from the image.<\/p>\n<h3 class=\"wp-block-heading\">Transformer decoder<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f3f2f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f3f2f1;\" loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"537\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-14.png?resize=900%2C537&#038;ssl=1\" alt=\"\" class=\"wp-image-599196 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-14.png 900w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-14-300x179.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/CPTR-14-768x458.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><figcaption class=\"wp-element-caption\">Figure 14. The entire Transformer Decoder in the CPTR architecture [5].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now that we have successfully created all components for the entire decoder, what I am going to do next is to put them together into a single class. Look at the Codeblock 19a and 19b below to see how I do that.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 19a\nclass Decoder(nn.Module):\n   def __init__(self):\n       super().__init__()\n\n       #(1)\n       self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,\n                                     embedding_dim=EMBED_DIM)\n\n       #(2)\n       self.sinusoidal_embedding = SinusoidalEmbedding()\n\n       #(3)\n       self.decoder_blocks = nn.ModuleList(DecoderBlock() for _ in range(NUM_DECODER_BLOCKS))\n\n       #(4)\n       self.linear = nn.Linear(in_features=EMBED_DIM,\n                               out_features=VOCAB_SIZE)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If you compare this <code>Decoder<\/code> class with the <code>Encoder<\/code> class from codeblock 9, you\u2019ll notice that they are somewhat similar in terms of the structure. In the encoder, we convert image patches into vectors using <code>Patcher<\/code>, while in the decoder we convert every single word in the caption into a vector using the <code>nn.Embedding layer<\/code> (<code>#(1)<\/code>), which I haven\u2019t explained earlier. Afterward, we initialize the positional embedding layer, where for the decoder we use the <em>sinusoidal<\/em> rather than the <em>trainable<\/em> one (<code>#(2)<\/code>). Next, we stack multiple decoder blocks using <code>nn.ModuleList<\/code> (<code>#(3)<\/code>). The linear layer written at line #(4), which doesn\u2019t exist in the encoder, is necessary to be implemented here since it will be responsible to map each of the embedded words into a vector of length <code>VOCAB_SIZE<\/code> (10000). Later on, this vector will contain the logit of every word in the dictionary, and what we need to do afterward is just to take the index containing the highest value, i.e., the most likely word to be predicted.<\/p>\n<p class=\"wp-block-paragraph\">The flow of the tensors within the <code>forward()<\/code> method itself is also pretty similar to the one in the <code>Encoder<\/code> class. In the Codeblock 19b below we pass features, captions, and <code>attn_mask<\/code> as the input (<code>#(1)<\/code>). Keep in mind that in this case the captions tensor contains the raw word sequence, so we need to vectorize these words with the embedding layer beforehand (<code>#(2)<\/code>). Next, we inject the sinusoidal positional embedding tensor using the code at line <code>#(3)<\/code> before eventually passing it through the four decoder blocks sequentially (<code>#(4)<\/code>). Finally, we pass the resulting tensor through the last linear layer to obtain the <code>prediction<\/code> logits (<code>#(5)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 19b\n   def forward(self, features, captions, attn_mask):  #(1)\n       print(f\"featurestt: {features.shape}\")\n       print(f\"captionstt: {captions.shape}\")\n      \n       captions = self.embedding(captions)  #(2)\n       print(f\"after embeddingtt: {captions.shape}\")\n      \n       captions = captions + self.sinusoidal_embedding()  #(3)\n       print(f\"after sin embedtt: {captions.shape}\")\n      \n       for i, decoder_block in enumerate(self.decoder_blocks):\n           captions = decoder_block(features, captions, attn_mask)  #(4)\n           print(f\"after decoder block #{i}t: {captions.shape}\")\n      \n       captions = self.linear(captions)  #(5)\n       print(f\"after lineartt: {captions.shape}\")\n      \n       return captions<\/code><\/pre>\n<p class=\"wp-block-paragraph\">At this point you might be wondering why we don\u2019t implement the softmax activation function as drawn in the illustration. This is essentially because during the training phase, softmax is typically included within the loss function, whereas in the inference phase, the index of the largest value will remain the same regardless of whether softmax is applied.<\/p>\n<p class=\"wp-block-paragraph\">Now let\u2019s run the following testing code to check whether there are errors in our implementation. Previously I mentioned that the captions input of the <code>Decoder<\/code> class is a raw word sequence. To simulate this, we can simply create a sequence of random integers ranging between 0 and <code>VOCAB_SIZE<\/code> (10000) with the length of <code>SEQ_LENGTH<\/code> (30) words (<code>#(1)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 20\ndecoder = Decoder()\n\nfeatures = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)\ncaptions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(1)\n\ncaptions = decoder(features, captions, look_ahead_mask)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">And below is what the resulting output looks like. Here you can see in the last line that the linear layer produced a tensor of size 30\u00d710000, indicating that our decoder model is now capable of predicting the logit scores for each word in the vocabulary across all 30 sequence positions.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 20 Output\nfeatures               : torch.Size([1, 576, 768])\ncaptions               : torch.Size([1, 30])\nafter embedding        : torch.Size([1, 30, 768])\nafter sin embed        : torch.Size([1, 30, 768])\nafter decoder block #0 : torch.Size([1, 30, 768])\nafter decoder block #1 : torch.Size([1, 30, 768])\nafter decoder block #2 : torch.Size([1, 30, 768])\nafter decoder block #3 : torch.Size([1, 30, 768])\nafter linear           : torch.Size([1, 30, 10000])<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Transformer decoder (alternative)<\/h3>\n<p class=\"wp-block-paragraph\">It is actually also possible to make the code simpler by replacing the <code>DecoderBlock<\/code> class with the <code>nn.TransformerDecoderLayer<\/code>, just like what we did in the ViT Encoder. Below is what the code looks like if we use this approach instead.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 21\nclass DecoderTorch(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,\n                                     embedding_dim=EMBED_DIM)\n      \n       self.sinusoidal_embedding = SinusoidalEmbedding()\n      \n       #(1)\n       decoder_block = nn.TransformerDecoderLayer(d_model=EMBED_DIM,\n                                                  nhead=NUM_HEADS,\n                                                  dim_feedforward=HIDDEN_DIM,\n                                                  dropout=DROP_PROB,\n                                                  batch_first=True)\n      \n       #(2)\n       self.decoder_blocks = nn.TransformerDecoder(decoder_layer=decoder_block,\n                                                   num_layers=NUM_DECODER_BLOCKS)\n      \n       self.linear = nn.Linear(in_features=EMBED_DIM,\n                               out_features=VOCAB_SIZE)\n      \n   def forward(self, features, captions, tgt_mask):\n       print(f\"featurestt: {features.shape}\")\n       print(f\"captionstt: {captions.shape}\")\n      \n       captions = self.embedding(captions)\n       print(f\"after embeddingtt: {captions.shape}\")\n      \n       captions = captions + self.sinusoidal_embedding()\n       print(f\"after sin embedtt: {captions.shape}\")\n      \n       #(3)\n       captions = self.decoder_blocks(tgt=captions,\n                                      memory=features,\n                                      tgt_mask=tgt_mask)\n       print(f\"after decoder blockst: {captions.shape}\")\n      \n       captions = self.linear(captions)\n       print(f\"after lineartt: {captions.shape}\")\n      \n       return captions<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The main difference you will see in the <code>__init__()<\/code> method is the use of <code>nn.TransformerDecoderLayer<\/code> and <code>nn.TransformerDecoder<\/code> at line <code>#(1)<\/code> and <code>#(2)<\/code>, where the former is used to initialize a single decoder block, and the latter is for repeating the block multiple times. Next, the <code>forward()<\/code> method is mostly similar to the one in the <code>Decoder<\/code> class, except that the forward propagation on the decoder blocks is automatically repeated four times without needing to be put inside a loop (<code>#(3)<\/code>). One thing that you need to pay attention to in the <code>decoder_blocks<\/code> layer is that the tensor coming from the encoder (features) must be passed as the argument for the <code>memory<\/code> parameter. Meanwhile, the tensor from the decoder itself (captions) has to be passed as the input to the <code>tgt<\/code> parameter.<\/p>\n<p class=\"wp-block-paragraph\">The testing code for the <code>DecoderTorch<\/code> model below is basically the same as the one written in Codeblock 20. Here you can see that this model also generates the final output tensor of size 30\u00d710000.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 22\ndecoder_torch = DecoderTorch()\n\nfeatures = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)\ncaptions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))\n\ncaptions = decoder_torch(features, captions, look_ahead_mask)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 22 Output\nfeatures             : torch.Size([1, 576, 768])\ncaptions             : torch.Size([1, 30])\nafter embedding      : torch.Size([1, 30, 768])\nafter sin embed      : torch.Size([1, 30, 768])\nafter decoder blocks : torch.Size([1, 30, 768])\nafter linear         : torch.Size([1, 30, 10000])<\/code><\/pre>\n<h2 class=\"wp-block-heading\">The entire CPTR model<\/h2>\n<p class=\"wp-block-paragraph\">Finally, it\u2019s time to put the encoder and the decoder part we just created into a single class to actually construct the CPTR architecture. You can see in Codeblock 23 below that the implementation is very simple. All we need to do here is just to initialize the encoder (<code>#(1)<\/code>) and the decoder (<code>#(2)<\/code>) components, then pass the raw images and the corresponding caption ground truths as well as the look-ahead mask to the <code>forward()<\/code> method (#(3)). Additionally, it is also possible for you to replace the <code>Encoder<\/code> and the <code>Decoder<\/code> with <code>EncoderTorch<\/code> and <code>DecoderTorch<\/code>, respectively.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 23\nclass EncoderDecoder(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.encoder = Encoder()  #EncoderTorch()  #(1)\n       self.decoder = Decoder()  #DecoderTorch()  #(2)\n      \n   def forward(self, images, captions, look_ahead_mask):  #(3)\n       print(f\"imagesttt: {images.shape}\")\n       print(f\"captionstt: {captions.shape}\")\n      \n       features = self.encoder(images)\n       print(f\"after encodertt: {features.shape}\")\n      \n       captions = self.decoder(features, captions, look_ahead_mask)\n       print(f\"after decodertt: {captions.shape}\")\n      \n       return captions<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We can do the testing by passing dummy tensors through it. See the Codeblock 24 below for the details. In this case, images is basically just a tensor of random numbers having the dimension of 1\u00d73\u00d7384\u00d7384 (<code>#(1)<\/code>), while captions is a tensor of size 1\u00d730 containing random integers (<code>#(2)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 24\nencoder_decoder = EncoderDecoder()\n\nimages = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(1)\ncaptions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)\n\ncaptions = encoder_decoder(images, captions, look_ahead_mask)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Below is what the output looks like. We can see here that our input images and captions successfully went through all layers in the network, which basically means that the CPTR model we created is now ready to actually be trained on image captioning datasets.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 24 Output\nimages         : torch.Size([1, 3, 384, 384])\ncaptions       : torch.Size([1, 30])\nafter encoder  : torch.Size([1, 576, 768])\nafter decoder  : torch.Size([1, 30, 10000])<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Ending<\/h2>\n<p class=\"wp-block-paragraph\">That was pretty much everything about the theory and implementation of the CaPtion TransformeR architecture. Let me know what deep learning architecture I should implement next. Feel free to leave a comment if you spot any mistakes in this article!<\/p>\n<p class=\"wp-block-paragraph\"><em>The code used in this article is available in <\/em><a href=\"https:\/\/github.com\/MuhammadArdiPutra\/medium_articles\/blob\/main\/Image%20Captioning%2C%20Transformer%20Mode%20On.ipynb\"><em>my GitHub repo<\/em><\/a><em>. Here\u2019s the link to my previous article about <\/em><a href=\"https:\/\/towardsdatascience.com\/show-and-tell-e1a1142456e2\/\"><em>image captioning<\/em><\/a><em>, <\/em><a href=\"https:\/\/towardsdatascience.com\/paper-walkthrough-vision-transformer-vit-c5dcf76f1a7a\"><em>Vision Transformer (ViT)<\/em><\/a><em>, and the original <\/em><a href=\"https:\/\/towardsdatascience.com\/paper-walkthrough-attention-is-all-you-need-80399cdc59e1\"><em>Transformer<\/em><\/a><em>.<\/em><\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] Wei Liu <em>et al.<\/em> CPTR: Full Transformer Network for Image Captioning. Arxiv.<a href=\"https:\/\/arxiv.org\/pdf\/2101.10804\"> https:\/\/arxiv.org\/pdf\/2101.10804<\/a> [Accessed November 16, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[2] Oriol Vinyals <em>et al. <\/em>Show and Tell: A Neural Image Caption Generator. Arxiv.<a href=\"https:\/\/arxiv.org\/pdf\/1411.4555\"> https:\/\/arxiv.org\/pdf\/1411.4555<\/a> [Accessed December 3, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[3] Image originally created by author based on: Alexey Dosovitskiy <em>et al.<\/em> An Image is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale. Arxiv.<a href=\"https:\/\/arxiv.org\/pdf\/2010.11929\"> https:\/\/arxiv.org\/pdf\/2010.11929<\/a> [Accessed December 3, 2024].<\/p>\n<p class=\"wp-block-paragraph\">[4] Image originally created by author based on [6].<\/p>\n<p class=\"wp-block-paragraph\">[5] Image originally created by author based on [1].<\/p>\n<p class=\"wp-block-paragraph\">[6] Ashish Vaswani <em>et al.<\/em> Attention Is All You Need. Arxiv.<a href=\"https:\/\/arxiv.org\/pdf\/1706.03762\"> https:\/\/arxiv.org\/pdf\/1706.03762<\/a> [Accessed December 3, 2024].<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/image-captioning-transformer-mode-on\/\">Image Captioning, Transformer Mode On<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Muhammad Ardi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/image-captioning-transformer-mode-on\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Image Captioning, Transformer Mode On Introduction In my previous article, I discussed one of the earliest Deep Learning approaches for image captioning. If you\u2019re interested in reading it, you can find the link to that article at the end of this one. Today, I would like to talk about Image Captioning again, but this time [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,67,88,1961,70,1962],"tags":[1964,845,1963],"class_list":["post-2287","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-deep-dives","category-deep-learning","category-image-captioning","category-machine-learning","category-transformer","tag-captioning","tag-image","tag-transformer"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2287"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2287"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2287\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}