{"id":3594,"date":"2025-05-06T07:04:48","date_gmt":"2025-05-06T07:04:48","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/06\/the-cnn-that-challenges-vit\/"},"modified":"2025-05-06T07:04:48","modified_gmt":"2025-05-06T07:04:48","slug":"the-cnn-that-challenges-vit","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/06\/the-cnn-that-challenges-vit\/","title":{"rendered":"The CNN That Challenges ViT"},"content":{"rendered":"<p>    The CNN That Challenges ViT<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1744736702000\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\">The invention of ViT (Vision Transformer) causes us to think that CNNs are obsolete.\u200a\u200aBut is this really true?<\/p>\n<p class=\"wp-block-paragraph\">It is widely believed that the impressive performance of ViT comes primarily from its transformer-based architecture. However, researchers from Meta argued that it\u2019s not entirely true. If we take a closer look at the architectural design, ViT introduced radical changes not only to the structure of the network but also to the model configurations. Meta\u2019s researchers thought that perhaps it is not the structure that makes ViT superior, but its configuration. In order to prove this, they tried to apply the ViT configuration parameters to the ResNet architecture from 2015.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">\u2014 And they found their thesis true.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\">In this article I am going to talk about ConvNeXt which was first proposed in the paper titled \u201c<em>A ConvNet for the 2020s<\/em>\u201d written by Liu <em>et al.<\/em> [1] back in 2022. Here I\u2019ll also try to implement it myself from scratch with PyTorch so that you can get better understanding of the changes made from the original ResNet. In fact, the actual ConvNeXt implementation is available in their GitHub repository [2], but I find it too complex to explain line by line. Thus, I decided to write it down on my own so that I can explain it with my style, which I believe is more beginner-friendly. Disclaimer on, my implementation might not perfectly replicate the original one, but I think it\u2019s still good to consider my code as a resource to learn. So, after reading my article I recommend you check the original code especially if you\u2019re planning to use ConvNeXt for your project.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">The Hyperparameter Tuning<\/h2>\n<p class=\"wp-block-paragraph\">What the authors essentially did in the research was hyperparameter tuning on the ResNet model. Generally speaking, there were five aspects they experimented with: <em>macro design<\/em>, <em>ResNeXt<\/em>, <em>inverted bottleneck<\/em>, <em>large kernel<\/em>, and <em>micro design<\/em>. We can see the experimental results on these aspects in the following figure.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1jVLy6bFa6ldR3NNfpodS8g.png?ssl=1\" alt=\"\" class=\"wp-image-603177\"><figcaption class=\"wp-element-caption\">Figure 1. The hyperparameter tuning results done on the original ResNet architecture [1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">There were two ResNet variants used in their experiments: ResNet-50 and ResNet-200 (shown in purple and gray, respectively). Let\u2019s now focus on the results obtained from tuning the ResNet-50 architecture. Based on the figure, we can see that this model initially obtained 78.8% accuracy on ImageNet dataset. They tuned this model until eventually it reached 82.0%, surpassing the state-of-the-art Swin-T architecture which only achieved 81.3% (the orange bar). This tuned version of the ResNet model is the one so-called ConvNeXt proposed in the paper. Their experiments on ResNet-200 confirm that the previous results are valid since its tuned version, i.e., ConvNeXt-B, also successfully surpasses the performance of Swin-B (the larger variant of Swin-T).<\/p>\n<h3 class=\"wp-block-heading\">Macro Design<\/h3>\n<p class=\"wp-block-paragraph\">The first change made on the original ResNet was the <em>macro design<\/em>. If we take a closer look at Figure 2 below, we can see that a ResNet model essentially consists of four main stages, namely <em>conv2_x<\/em>, <em>conv3_x<\/em>, <em>conv4_x <\/em>and <em>conv5_x<\/em>, which each of them also comprises multiple <em>bottleneck<\/em> blocks. Talking more specifically about ResNet-50, the bottleneck blocks in each stage is repeated 3, 4, 6 and 3 times, respectively. Later on, I\u2019ll refer to these numbers as <em>stage ratio<\/em>.<\/p>\n<figure class=\"wp-block-image alignwide\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1XpgH0qrZ4uzUrzxbQw7nfA.png?ssl=1\" alt=\"\" class=\"wp-image-603183\"><figcaption class=\"wp-element-caption\">Figure 2. The ResNet architecture variants\u00a0[3].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The authors of the ConvNeXt paper tried to change this stage ratio according to the Swin-T architecture, i.e., 1:1:3:1. Well, it\u2019s actually 2:2:6:2 if you see the architectural details from the original Swin Transformer paper in Figure 3, but it\u2019s basically just a derivation from the same ratio. By applying this configuration, authors obtained 0.6% improvement (from 78.8% to 79.4%). Thus, they decided to use 1:1:3:1 stage ratio for the upcoming experiments.<\/p>\n<figure class=\"wp-block-image alignwide\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/19LdnTmb2t9aRTDjWPelqxA.png?ssl=1\" alt=\"\" class=\"wp-image-603182\"><figcaption class=\"wp-element-caption\">Figure 3. The Swin Transformer architecture variants\u00a0[4].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Still related to macro design, changes were also made to the first convolution layer of ResNet. If you go back to Figure 2 (the <em>conv1<\/em> row), you\u2019ll see that it originally uses 7\u00d77 kernel with stride 2, which reduces the image size from 224\u00d7224 to 112\u00d7112. Being inspired by Swin Transformer, authors also wanted to treat the input image as non-overlapping patches. Thus, they changed the kernel size to 4\u00d74 and the stride to 4. This idea was actually adopted from the original ViT, where it uses 16\u00d716 kernel with stride 16. One thing you need to know in ConvNeXt is that the resulting patches are treated as a standard image rather than a sequence. With this modification, the accuracy slightly improved from 79.4% to 79.5%. Hence, authors used this configuration for the first convolution layer in the next experiments.<\/p>\n<h3 class=\"wp-block-heading\">ResNeXt-ification<\/h3>\n<p class=\"wp-block-paragraph\">As the macro design is done, the next thing authors did was to adopt the ResNeXt architecture, which was first proposed in a paper titled \u201c<em>Aggregated Residual Transformations for Deep Neural Networks<\/em>\u201d [5]. The idea of ResNeXt is that it basically applies group convolution to the bottleneck blocks of the ResNet architecture. In case you\u2019re not yet familiar with group convolution, it essentially works by separating input channels into groups and performing convolution operations within each group independently, allowing faster computation as the number of groups increases. ConvNeXt adopts this idea by setting the number of groups to be the same as the number of kernels. This approach, which is commonly known as <em>depthwise convolution<\/em>, enables the network to obtain the lowest possible computational complexity. However, it is important to note that increasing the number of convolution groups like this leads to a reduction in accuracy as it lowers the model capacity to learn. Thus, the drop in accuracy to 78.3% was expected.<\/p>\n<p class=\"wp-block-paragraph\">That wasn\u2019t the end of the ResNeXt-ification section, though. In fact, the ResNeXt paper gives us a guidance that if we increase the number of groups, we also need to expand the width of the network, i.e., add more channels. Thus, ConvNeXt authors readjusted the number of kernels based on the one used in Swin-T. You can see in Figure 2 and 3 that ResNet originally uses 64, 128, 256 and 512 kernels in each stage, whereas Swin-T uses 96, 192, 384, and 768. Such an increase in the model width allows the network to significantly push the accuracy to 80.5%.<\/p>\n<h3 class=\"wp-block-heading\">Inverted Bottleneck<\/h3>\n<p class=\"wp-block-paragraph\">Still with Figure 2, it is also seen that ResNet-50, ResNet-101, and ResNet-152 share the exact same bottleneck structure. For instance, the block at stage <em>conv5_x<\/em> consists of 3 convolution layers with 512, 512, and 2048 kernels, where the input of the first convolution is either 1024 (coming from the <em>conv4_x<\/em> stage) or 2048 (from the previous block in the <em>conv5_x<\/em> stage itself). These ResNet variations essentially follow the <em>wide \u2192 narrow \u2192 wide<\/em> structure, which is the reason that this block is called <em>bottleneck<\/em>. Instead of using a structure like this, ConvNeXt employs the inverted version of bottleneck, where it follows the <em>narrow \u2192 wide \u2192 narrow<\/em> structure adopted from the feed-forward layer of the Transformer architecture. In Figure 4 below (a) is the <em>bottleneck<\/em> block used in ResNet and (b) is the so-called <em>inverted bottleneck<\/em> block. By using this structure, the model accuracy increased from 80.5% to 80.6%.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1AX881phAfF4DmZk6UrTD6g.png?ssl=1\" alt=\"\" class=\"wp-image-603179\"><figcaption class=\"wp-element-caption\">Figure 4. The bottleneck block in ResNeXt (a), the inverted bottleneck block (b), and the ConvNeXt block (c)\u00a0[1].<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Kernel Size<\/h3>\n<p class=\"wp-block-paragraph\">The next exploration was done on the kernel size inside the inverted bottleneck block. Before experimenting with different kernel sizes, further modification was done to the structure of the block, where authors swapped the order of the first and second layer such that the depthwise convolution is now placed at the beginning of the block as seen in Figure 4 (c). Thanks to this modification, the block is now called <em>ConvNeXt block<\/em> as it no longer completely resembles the original inverted bottleneck structure. This idea was actually adopted from Transformer, where the MSA (Multihead Self-Attention) layer is placed before the MLP layers. In the case of ConvNeXt, the depthwise convolution acts as the replacement of MSA, while the linear layers in MLP Transformers are replaced by pointwise convolutions. Simply moving up the depthwise convolution like this reduced the accuracy from 80.6% to 79.9%. However, this is acceptable because the current experiment set is still ongoing.<\/p>\n<p class=\"wp-block-paragraph\">Experiments on the kernel size was then applied only on the depthwise convolution layer, leaving the remaining pointwise convolutions unchanged. Here authors tried to use different kernel sizes, where they found that 7\u00d77 worked best as it successfully recovered the accuracy back to 80.6% with lower computational complexity (4.6 vs 4.2 GFLOPS). Interestingly, this kernel size matches the window dimensions in the Swin Transformer architecture, which corresponds to the patch size used in the self-attention mechanism. You can actually see this in Figure 3 where the window sizes in Swin Transformer variants are all 7\u00d77.<\/p>\n<h3 class=\"wp-block-heading\">Micro Design<\/h3>\n<p class=\"wp-block-paragraph\">The final aspect tuned in the paper is the so-called <em>micro design<\/em>, which essentially refers to the things related to the intricate details of the network. Similar to the previous ones, the parameters used here are mainly also adopted from Transformers. Authors initially replaced ReLU with GELU. Even though with this replacement the accuracy remained the same (80.6%), but they decided to go with this activation function for the subsequent experiments. The accuracy finally increased after the number of activation functions was reduced. Instead of applying GELU after each convolution layer in the ConvNeXt block, this activation function was placed only between the two pointwise convolutions. This modification allowed the network to boost the accuracy up to 81.3%, at which point this score was already on par with the Swin-T architecture while still having lower GFLOPS (4.2 vs 4.5).<\/p>\n<p class=\"wp-block-paragraph\">Next, it is a common practice to use <em>Conv-BN-ReLU<\/em> structure in CNN-based architecture, which is exactly what ResNet implements as well. Instead of following this convention, authors decided to implement only a single batch normalization layer, which is placed before the first pointwise convolution layer. This change improved the accuracy to 81.4%, surpassing the accuracy of Swin-T by a little bit. Despite this achievement, parameter tuning was still continued by replacing batch norm with layer norm, which again raised the accuracy by 0.1% to 81.5%. All the modifications related to micro design resulted in the architecture shown in Figure 5 (the rightmost image). Here you can see how a ConvNeXt block differs from Swin Transformer and ResNet blocks.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1x_2FqNNvmAzS9uq_PubJeA.png?ssl=1\" alt=\"\" class=\"wp-image-603180\"><figcaption class=\"wp-element-caption\">Figure 5. What the Swin-T, ResNet-50 and ConvNeXt-T blocks look like at the initial stage\u00a0[1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The last thing the authors did related to the micro design was applying separate downsampling layers. In the original ResNet architecture, the spatial dimension of a tensor reduces by half when we move from one stage to another. You can see in Figure 2 that initially ResNet accepts input of size 224\u00d7224 which then shrinks to 112\u00d7112, 56\u00d756, 28\u00d728, 14\u00d714, and 7\u00d77 at stage <em>conv1<\/em>, <em>conv2_x<\/em>, <em>conv3_x<\/em>, <em>conv4_x<\/em> and <em>conv5_x<\/em>, respectively. Especially in <em>conv2_x<\/em> and the subsequent ones, the spatial dimension reduction is done by changing the stride parameter of the pointwise convolution to 2. Instead of doing so, ConvNeXt performs downsampling by placing another convolution layer right before the element-wise summation operation within the block. The kernel size and stride of this layer are set to 2, simulating a non-overlapping sliding window. In fact, it is mentioned in the paper that using this separate downsampling layer caused the accuracy to degrade instead. Nevertheless, authors managed to solve this issue by applying additional layer normalization layers at several parts of the network, i.e., before each downsampling layer, after the <em>stem<\/em> stage and after the global average pooling layer (right before the final output layer). With this tuning, authors successfully boosted the accuracy to 82.0%, which is much higher than Swin-T (81.3%) while still having the exact same GFLOPS (4.5).<\/p>\n<p class=\"wp-block-paragraph\">And that\u2019s basically all the modifications made on the original ResNet to create the ConvNeXt architecture. Don\u2019t worry if it still feels a bit unclear for now\u200a\u2014\u200aI believe things will become clearer as we get into the code.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">ConvNeXt Implementation<\/h2>\n<p class=\"wp-block-paragraph\">Figure 6 below displays the details of the entire ConvNeXt-T architecture which we will later implement every single of its components one by one. Here you can also see how it differs from ResNet-50 and Swin-T, the two models that are comparable to ConvNeXt-T.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1Posc_QEd2WO7PjOtgFJuTw.png?ssl=1\" alt=\"\" class=\"wp-image-603181\"><figcaption class=\"wp-element-caption\">Figure 6. The details of the ResNet-50, ConvNeXt-T, and Swin-T architectures [1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">When it comes to the implementation, the first thing we need to do is to import the required modules. The only two we import here are the base <code>torch<\/code> module and its <code>nn<\/code> submodule for loading neural network layers.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 1\nimport torch\nimport torch.nn as nn<\/code><\/pre>\n<h3 class=\"wp-block-heading\">ConvNeXt Block<\/h3>\n<p class=\"wp-block-paragraph\">Now let\u2019s start with the ConvNeXt block. You can see in Figure 6 that the block structures in <em>res2<\/em>, <em>res3<\/em>, <em>res4<\/em>, and <em>res5<\/em> stages are basically the same, in which all of those correspond to the rightmost illustration in Figure 5. Thanks to these identical structures, we can implement them in a single class and use it repeatedly. Look at the Codeblock 2a and 2b below to see how I do that.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 2a\nclass ConvNeXtBlock(nn.Module):\n    def __init__(self, num_channels):         #(1)\n        super().__init__()\n        hidden_channels = num_channels * 4    #(2)\n\n        \n        self.conv0 = nn.Conv2d(in_channels=num_channels,         #(3) \n                               out_channels=num_channels,        #(4)\n                               kernel_size=7,    #(5)\n                               stride=1,\n                               padding=3,        #(6)\n                               groups=num_channels)              #(7)\n        \n        self.norm = nn.LayerNorm(normalized_shape=num_channels)  #(8)\n        \n        self.conv1 = nn.Conv2d(in_channels=num_channels,         #(9)\n                               out_channels=hidden_channels, \n                               kernel_size=1, \n                               stride=1, \n                               padding=0)\n        \n        self.gelu = nn.GELU()  #(10)\n        \n        self.conv2 = nn.Conv2d(in_channels=hidden_channels,      #(11)\n                               out_channels=num_channels, \n                               kernel_size=1, \n                               stride=1, \n                               padding=0)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I decided to name this class <code>ConvNeXtBlock<\/code>. You can see at line <code>#(1)<\/code> in the above codeblock that this class accepts <code>num_channels<\/code> as the only parameter, in which it denotes both the number of input and output channels. Remember that a ConvNeXt block follows the pattern of the inverted bottleneck structure, i.e., <em>narrow<\/em> \u2192 <em>wide<\/em> \u2192 <em>narrow<\/em>. If you take a closer look at Figure 6, you\u2019ll notice that the <em>wide<\/em> part is 4 times larger than the <em>narrow<\/em> part. Thus, we set the value of the <code>hidden_channels<\/code> variable accordingly (<code>#(2)<\/code>).\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Next, we initialize 3 convolution layers which I refer to them as <code>conv0<\/code>, <code>conv1<\/code> and <code>conv2<\/code>. Every single of these convolution layers has their own specifications. For <code>conv0<\/code>, we set the number of input and output channels to be the same, which is the reason that both its <code>in_channels<\/code> and <code>out_channels<\/code> parameters are set to <code>num_channels<\/code> (<code>#(3\u20134)<\/code>). We set the kernel size of this layer to 7\u00d77 (<code>#(5)<\/code>). Given this specification, we need to set the padding size to 3 in order to retain the spatial dimension (<code>#(6)<\/code>). Don\u2019t forget to set the <code>groups<\/code> parameter to <code>num_channels<\/code> because we want this to be a depthwise convolution layer (<code>#(7)<\/code>). On the other hand, the <code>conv1<\/code> layer (<code>#(9)<\/code>) is responsible to increase the number of image channels, whereas the subsequent <code>conv2<\/code> layer (<code>#(11)<\/code>) is employed to shrink the tensor back to the original channel count. It is important to note that <code>conv1<\/code> and <code>conv2<\/code> are both using 1\u00d71 kernel size, which essentially means that it only works by combining information along the channel dimension. Additionally, here we also need to initialize layer norm (<code>#(8)<\/code>) and GELU activation function (<code>#(10)<\/code>) as the replacement for batch norm and ReLU.<\/p>\n<p class=\"wp-block-paragraph\">As all layers required in the ConvNeXtBlock have been initialized, what we need to do next is to define the flow of the tensor in the <code>forward()<\/code> method below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 2b\n    def forward(self, x):\n        residual = x                 #(1)\n        print(f'x &amp; residualt: {x.size()}')\n        \n        x = self.conv0(x)\n        print(f'after conv0t: {x.size()}')\n        \n        x = x.permute(0, 2, 3, 1)    #(2)\n        print(f'after permutet: {x.size()}')\n        \n        x = self.norm(x)\n        print(f'after normt: {x.size()}')\n        \n        x = x.permute(0, 3, 1, 2)    #(3)\n        print(f'after permutet: {x.size()}')\n        \n        x = self.conv1(x)\n        print(f'after conv1t: {x.size()}')\n        \n        x = self.gelu(x)\n        print(f'after gelut: {x.size()}')\n        \n        x = self.conv2(x)\n        print(f'after conv2t: {x.size()}')\n        \n        x = x + residual             #(4)\n        print(f'after summationt: {x.size()}')\n        \n        return x<\/code><\/pre>\n<p class=\"wp-block-paragraph\">What we basically do in the above code is just passing the tensor to each layer we defined earlier sequentially. However, there are two things I need to highlight here. First, we need to store the original input tensor to the <code>residual<\/code> variable (<code>#(1)<\/code>), in which it will skip over all operations within the ConvNeXt block. Secondly, remember that layer norm is commonly used for sequential data, where it typically has a different shape from that of image data. Due to this reason, we need to adjust the tensor dimension such that the shape becomes <em>(N, H, W, C)<\/em> (<code>#(2)<\/code>) before we actually perform the layer normalization operation. Afterwards, don\u2019t forget to permute this tensor back to <em>(N, C, H, W)<\/em> (<code>#(3)<\/code>). The resulting tensor is then passed through the remaining layers before being summed with the residual connection (<code>#(4)<\/code>).<\/p>\n<p class=\"wp-block-paragraph\">To check if our <code>ConvNeXtBlock<\/code> class works properly, we can test it using the Codeblock 3 below. Here we are going to simulate the block used in <em>res2<\/em> stage. So, we set the <code>num_channels<\/code> parameter to 96 (<code>#(1)<\/code>) and create a dummy tensor which we assume as a batch of single image of size 56\u00d756 (<code>#(2)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 3\nconvnext_block_test = ConvNeXtBlock(num_channels=96)  #(1)\nx_test = torch.rand(1, 96, 56, 56)  #(2)\n\nout_test = convnext_block_test(x_test)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Below is what the resulting output looks like. Talking about the internal flow, it seems like all layers we stacked earlier work properly. At line <code>#(1)<\/code> in the output below we can see that the tensor dimension changed to 1\u00d756\u00d756\u00d796 <em>(N, H, W, C) <\/em>after being permuted. This tensor size then changed back to 1\u00d796\u00d756\u00d756 <em>(N, C, H, W) <\/em>after the second permute operation (<code>#(2)<\/code>). Next, the <em>conv1<\/em> layer successfully expanded the number of channels to be 4 times greater than the input (<code>#(3)<\/code>) which was then reduced back to the original channel count (<code>#(4)<\/code>). Here you can see that the tensor shape at the first and the last layer are exactly the same, allowing us to stack multiple ConvNeXt blocks as many as we want.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 3 Output\nx &amp; residual    : torch.Size([1, 96, 56, 56])\nafter conv0     : torch.Size([1, 96, 56, 56])  \nafter permute   : torch.Size([1, 56, 56, 96])    #(1)\nafter norm      : torch.Size([1, 56, 56, 96])\nafter permute   : torch.Size([1, 96, 56, 56])    #(2)\nafter conv1     : torch.Size([1, 384, 56, 56])   #(3)\nafter gelu      : torch.Size([1, 384, 56, 56])\nafter conv2     : torch.Size([1, 96, 56, 56])    #(4)\nafter summation : torch.Size([1, 96, 56, 56])<\/code><\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">ConvNeXt Block Transition<\/h3>\n<p class=\"wp-block-paragraph\">The next component I want to implement is the one I refer to as the <em>ConvNeXt block transition<\/em>. The idea of this block is actually similar to the ConvNeXt block we implemented earlier, except that this transition block is used when we are about to move from a stage to the subsequent one. More specifically, this block will later be employed as the first ConvNeXt block in each stage (except <em>res2<\/em>). The reason I implement it in separate class is that there are some intricate details that differ from the ConvNeXt block. Additionally, it is worth noting that the term <em>transition<\/em> is not officially used in the paper. Rather, it\u2019s just the word I use on my own to describe this idea.\u200a\u2014\u200aI actually also used this technique back when I write about the smaller ResNet version, i.e., ResNet-18 and ResNet-34. Click on the link at reference number [6] at the end of this article if you\u2019re interested to read that one.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4a\nclass ConvNeXtBlockTransition(nn.Module):\n    def __init__(self, in_channels, out_channels):  #(1)\n        super().__init__()\n        hidden_channels = out_channels * 4\n        \n        self.projection = nn.Conv2d(in_channels=in_channels,      #(2) \n                                    out_channels=out_channels, \n                                    kernel_size=1, \n                                    stride=2,\n                                    padding=0)\n        \n        self.conv0 = nn.Conv2d(in_channels=in_channels, \n                               out_channels=out_channels, \n                               kernel_size=7,\n                               stride=1,\n                               padding=3,\n                               groups=in_channels)\n        \n        self.norm0 = nn.LayerNorm(normalized_shape=out_channels)\n        \n        self.conv1 = nn.Conv2d(in_channels=out_channels, \n                               out_channels=hidden_channels, \n                               kernel_size=1, \n                               stride=1, \n                               padding=0)\n        \n        self.gelu = nn.GELU()\n        \n        self.conv2 = nn.Conv2d(in_channels=hidden_channels, \n                               out_channels=out_channels, \n                               kernel_size=1, \n                               stride=1,\n                               padding=0)\n        \n        self.norm1 = nn.LayerNorm(normalized_shape=out_channels)  #(3)\n\n        self.downsample = nn.Conv2d(in_channels=out_channels,     #(4)\n                                    out_channels=out_channels, \n                                    kernel_size=2, \n                                    stride=2)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first difference you might notice here is the input of the <code>__init__()<\/code> method, which in this case we separate the number of input and output channels into two parameters as seen at line <code>#(1)<\/code> in Codeblock 4a. This is essentially done because we need this block to take the output tensor from the previous stage which has different number of channels from that of the one to be generated in the subsequent stage. Referring to Figure 6, for example, if we were to create the first ConvNeXt block in <em>res3<\/em> stage, we need to configure it such that it accepts a tensor of 96 channels from <em>res2 <\/em>and returns another tensor with 192 channels.<\/p>\n<p class=\"wp-block-paragraph\">Secondly, here we implement the <em>separate downsample layer<\/em> I explained earlier (<code>#(4)<\/code>) alongside the corresponding layer norm to be placed before it (<code>#(3)<\/code>). As the name suggests, this layer is employed to reduce the spatial dimension of the image by half.<\/p>\n<p class=\"wp-block-paragraph\">Third, we initialize the so-called <em>projection layer<\/em> at line <code>#(2)<\/code>. In the ConvNeXtBlock we created earlier, this layer is not necessary because the input and output tensor is exactly the same. In the case of <em>transition<\/em> block, the image spatial dimension is reduced by half, while at the same time the number of output channels is doubled. This <em>projection<\/em> layer is responsible to adjust the dimension of the residual connection in order to match it with the one from the main flow, allowing element-wise operation to be performed.<\/p>\n<p class=\"wp-block-paragraph\">The <code>forward()<\/code> method in the Codeblock 4b below is also similar to the one belongs to the <code>ConvNeXtBlock<\/code> class, except that here the residual connection needs to be processed with the projection layer (<code>#(1)<\/code>) while the main tensor requires to be downsampled (<code>#(2)<\/code>) before the summation is done at line <code>#(3)<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 4b\n    def forward(self, x):\n        print(f'originaltt: {x.size()}')\n\n        residual = self.projection(x)  #(1)\n        print(f'residual after projt: {residual.size()}')\n        \n        x = self.conv0(x)\n        print(f'after conv0tt: {x.size()}')\n        \n        x = x.permute(0, 2, 3, 1)\n        print(f'after permutett: {x.size()}')\n        \n        x = self.norm0(x)\n        print(f'after norm1tt: {x.size()}')\n        \n        x = x.permute(0, 3, 1, 2)\n        print(f'after permutett: {x.size()}')\n        \n        x = self.conv1(x)\n        print(f'after conv1tt: {x.size()}')\n        \n        x = self.gelu(x)\n        print(f'after gelutt: {x.size()}')\n        \n        x = self.conv2(x)\n        print(f'after conv2tt: {x.size()}')\n\n        x = x.permute(0, 2, 3, 1)\n        print(f'after permutett: {x.size()}')\n        \n        x = self.norm1(x)\n        print(f'after norm1tt: {x.size()}')\n        \n        x = x.permute(0, 3, 1, 2)\n        print(f'after permutett: {x.size()}')\n        \n        x = self.downsample(x)  #(2)\n        print(f'after downsamplet: {x.size()}')\n        \n        x = x + residual  #(3)\n        print(f'after summationtt: {x.size()}')\n        \n        return x<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now let\u2019s test the <code>ConvNeXtBlockTransition<\/code> class above using the following codeblock. Suppose we are about to implement the first ConvNeXt block in stage <em>res3<\/em>. To do so, we can simply instantiate the transition block with <code>in_channels=96<\/code> and <code>out_channels=192<\/code> before eventually passing a dummy tensor of size 1\u00d796\u00d756\u00d756 through it.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 5\nconvnext_block_transition_test = ConvNeXtBlockTransition(in_channels=96, \n                                                         out_channels=192)\nx_test = torch.rand(1, 96, 56, 56)\n\nout_test = convnext_block_transition_test(x_test)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\"># Codeblock 5 Output\noriginal            : torch.Size([1, 96, 56, 56])\nresidual after proj : torch.Size([1, 192, 28, 28])  #(1)\nafter conv0         : torch.Size([1, 192, 56, 56])  #(2)\nafter permute       : torch.Size([1, 56, 56, 192])\nafter norm0         : torch.Size([1, 56, 56, 192])\nafter permute       : torch.Size([1, 192, 56, 56])\nafter conv1         : torch.Size([1, 768, 56, 56])\nafter gelu          : torch.Size([1, 768, 56, 56])\nafter conv2         : torch.Size([1, 192, 56, 56])  #(3)\nafter permute       : torch.Size([1, 56, 56, 192])\nafter norm1         : torch.Size([1, 56, 56, 192])\nafter permute       : torch.Size([1, 192, 56, 56])\nafter downsample    : torch.Size([1, 192, 28, 28])  #(4)\nafter summation     : torch.Size([1, 192, 28, 28])  #(5)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">You can see in the resulting output that our projection layer directly maps the 1\u00d796\u00d756\u00d756 residual tensor to 1\u00d7192\u00d728\u00d728 as shown at line <code>#(1)<\/code>. Meanwhile, the main tensor <code>x<\/code> needs to be processed by the other layers we initialized earlier to achieve this shape. The steps we performed from line <code>#(2)<\/code> to <code>#(3)<\/code> on the <code>x<\/code> tensor are basically the same as those in the <code>ConvNeXtBlock<\/code> class. At this point we already got the number of channels matches our need (192). The spatial dimension is then reduced after the tensor being processed by the <code>downsample<\/code> layer (<code>#(4)<\/code>). As the tensor dimensions of <code>x<\/code> and <code>residual<\/code> have matched, we can finally perform the element-wise summation (<code>#(5)<\/code>).<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">The Entire ConvNeXt Architecture<\/h3>\n<p class=\"wp-block-paragraph\">As we got <code>ConvNeXtBlock<\/code> and <code>ConvNeXtBlockTransition<\/code> classes ready to use, we can now start to construct the entire ConvNeXt architecture. Before we do that, I would like to introduce some config parameters first. See the Codeblock 6 below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 6\nIN_CHANNELS  = 3     #(1)\nIMAGE_SIZE   = 224   #(2)\n\nNUM_BLOCKS   = [3, 3, 9, 3]         #(3)\nOUT_CHANNELS = [96, 192, 384, 768]  #(4)\nNUM_CLASSES  = 1000  #(5)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first one is the dimension of the input image. As shown at line <code>#(1)<\/code> and <code>#(2)<\/code>, here we set <code>in_channels<\/code> to 3 and <code>image_size<\/code> to 224 since by default ConvNeXt accepts a batch of RGB images of that size. The next ones are related to the model configuration. In this case, I set the number of ConvNeXt blocks of each stage to <code>[3, 3, 9, 3]<\/code> (<code>#(3)<\/code>) and the corresponding number of output channels to <code>[96, 192, 384, 768]<\/code> (<code>#(4)<\/code>) since I want to implement the ConvNeXt-T variant. You can actually change these numbers according to the configuration provided by the original paper shown in Figure 7. Finally, we set the number of neurons of the output channel to 1000, which corresponds to the number of classes in the dataset we train the model on (<code>#(5)<\/code>).<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1EXPODqjCzuvV3CYMPv1H1w.png?ssl=1\" alt=\"\" class=\"wp-image-603178\"><figcaption class=\"wp-element-caption\">Figure 7. The ConvNeXt variants\u00a0[1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We will now implement the entire architecture in the <code>ConvNeXt<\/code> class shown in Codeblock 7a and 7b below. The following <code>__init__()<\/code> method might seem a bit complicated at glance, but don\u2019t worry as I\u2019ll explain it thoroughly.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 7a\nclass ConvNeXt(nn.Module):\n    def __init__(self):\n        super().__init__()\n        \n        self.stem = nn.Conv2d(in_channels=IN_CHANNELS,    #(1)\n                              out_channels=OUT_CHANNELS[0],\n                              kernel_size=4,\n                              stride=4,\n                             )\n\n        self.normstem = nn.LayerNorm(normalized_shape=OUT_CHANNELS[0])  #(2)\n        \n        #(3)\n        self.res2 = nn.ModuleList()\n        for _ in range(NUM_BLOCKS[0]):\n            self.res2.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[0]))\n        \n        #(4)\n        self.res3 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[0], \n                                                           out_channels=OUT_CHANNELS[1])])\n        for _ in range(NUM_BLOCKS[1]-1):\n            self.res3.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[1]))\n\n        #(5)\n        self.res4 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[1], \n                                                           out_channels=OUT_CHANNELS[2])])\n        for _ in range(NUM_BLOCKS[2]-1):\n            self.res4.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[2]))\n\n        #(6)\n        self.res5 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[2], \n                                                           out_channels=OUT_CHANNELS[3])])\n        for _ in range(NUM_BLOCKS[3]-1):\n            self.res5.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[3]))\n\n                \n        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(7)\n        self.normpool = nn.LayerNorm(normalized_shape=OUT_CHANNELS[3])  #(8)\n        self.fc = nn.Linear(in_features=OUT_CHANNELS[3],        #(9)\n                            out_features=NUM_CLASSES)\n        \n        self.relu = nn.ReLU()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first thing we do here is initializing the <em>stem<\/em> stage (<code>#(1)<\/code>), which is essentially just a convolution layer with 4\u00d74 kernel size and stride 4. This configuration will effectively reduce the image size to be 4 times smaller, where every single pixel in the output tensor represents a 4\u00d74 patch in the input tensor. For the subsequent stages, we need to wrap the corresponding ConvNeXt blocks with <code>nn.ModuleList()<\/code>. For stage <em>res3<\/em> (<code>#(4)<\/code>), <em>res4<\/em> (<code>#(5)<\/code>) and <em>res5<\/em> (<code>#(6)<\/code>) we place <code>ConvNeXtBlockTransition<\/code> at the beginning of each list as a \u201cbridge\u201d between stages. We don\u2019t do this for stage <em>res2<\/em> since the tensor produced by the <em>stem<\/em> stage is already compatible with it (<code>#(3)<\/code>). Next, we initialize an <code>nn.AdaptiveAvgPool2d<\/code> layer, which will be used to reduce the spatial dimensions of the tensor to 1\u00d71 by computing the mean across each channel (<code>#(7)<\/code>). In fact, this is the exact same process used by ResNet to prepare the tensor from the last convolution layer so that it matches the shape required by the subsequent output layer (<code>#(9)<\/code>). Additionally, don\u2019t forget to initialize two layer normalization layers which I refer to as <code>normstem<\/code> (<code>#(2)<\/code>) and <code>normpool<\/code> (<code>#(8)<\/code>), in which these two layers will then be placed right after the <code>stem<\/code> stage and the <code>avgpool<\/code> layer.<\/p>\n<p class=\"wp-block-paragraph\">The <code>forward()<\/code> method is pretty straightforward. All we need to do in the following code is just to place the layers one after another. Keep in mind that since the ConvNeXt blocks are stored in lists, we need to call them iteratively with loops as seen at line <code>#(1\u20134)<\/code>. Additionally, don\u2019t forget to reshape the tensor produced by the <code>nn.AdaptiveAvgPool2d<\/code> layer (<code>#(5)<\/code>) so that it will be compatible with the subsequent fully-connected layer (<code>#(6)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 7b\n    def forward(self, x):\n        print(f'originalt: {x.size()}')\n        \n        x = self.relu(self.stem(x))\n        print(f'after stemt: {x.size()}')\n\n        x = x.permute(0, 2, 3, 1)\n        print(f'after permutet: {x.size()}')\n        \n        x = self.normstem(x)\n        print(f'after normstemt: {x.size()}')\n        \n        x = x.permute(0, 3, 1, 2)\n        print(f'after permutet: {x.size()}')\n        \n        print()\n        for i, block in enumerate(self.res2):    #(1)\n            x = block(x)\n            print(f'after res2 #{i}t: {x.size()}')\n        \n        print()\n        for i, block in enumerate(self.res3):    #(2)\n            x = block(x)\n            print(f'after res3 #{i}t: {x.size()}')\n        \n        print()\n        for i, block in enumerate(self.res4):    #(3)\n            x = block(x)\n            print(f'after res4 #{i}t: {x.size()}')\n        \n        print()\n        for i, block in enumerate(self.res5):    #(4)\n            x = block(x)\n            print(f'after res5 #{i}t: {x.size()}')\n        \n        print()\n        x = self.avgpool(x)\n        print(f'after avgpoolt: {x.size()}')\n\n        x = x.permute(0, 2, 3, 1)\n        print(f'after permutet: {x.size()}')\n        \n        x = self.normpool(x)\n        print(f'after normpoolt: {x.size()}')\n        \n        x = x.permute(0, 3, 1, 2)\n        print(f'after permutet: {x.size()}')\n        \n        x = x.reshape(x.shape[0], -1)             #(5)\n        print(f'after reshapet: {x.size()}')\n        \n        x = self.fc(x)\n        print(f'after fct: {x.size()}')          #(6)\n        \n        return x<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now for the moment of truth, let\u2019s see if we have correctly implemented the entire ConvNeXt model by running the following code. Here I try to pass a tensor of size 1\u00d73\u00d7224\u00d7224 to the network, simulating a batch of a single RGB image of size 224\u00d7224.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Codeblock 8\nconvnext_test = ConvNeXt()\n\nx_test   = torch.rand(1, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)\nout_test = convnext_test(x_test)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">You can see in the following output that it looks like our implementation is correct as the behavior of the network aligns with the architectural design shown in Figure 6. The spatial dimension of the image gradually gets smaller as we get deeper into the network, and at the same time the number of channels increases instead thanks to the <code>ConvNeXtBlockTransition<\/code> blocks we placed at the beginning of stage <em>res3<\/em> (<code>#(1)<\/code>), <em>res4<\/em> (<code>#(2)<\/code>), and <em>res5<\/em> (<code>#(3)<\/code>). The <code>avgpool<\/code> layer then correctly downsampled the spatial dimension to 1\u00d71 (<code>#(4)<\/code>), allowing it to be connected to the output layer (<code>#(5)<\/code>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\"># Codeblock 8 Output\noriginal       : torch.Size([1, 3, 224, 224])\nafter stem     : torch.Size([1, 96, 56, 56])\nafter permute  : torch.Size([1, 56, 56, 96])\nafter normstem : torch.Size([1, 56, 56, 96])\nafter permute  : torch.Size([1, 96, 56, 56])\n\nafter res2 #0  : torch.Size([1, 96, 56, 56])\nafter res2 #1  : torch.Size([1, 96, 56, 56])\nafter res2 #2  : torch.Size([1, 96, 56, 56])\n\nafter res3 #0  : torch.Size([1, 192, 28, 28])  #(1)\nafter res3 #1  : torch.Size([1, 192, 28, 28])\nafter res3 #2  : torch.Size([1, 192, 28, 28])\n\nafter res4 #0  : torch.Size([1, 384, 14, 14])  #(2)\nafter res4 #1  : torch.Size([1, 384, 14, 14])\nafter res4 #2  : torch.Size([1, 384, 14, 14])\nafter res4 #3  : torch.Size([1, 384, 14, 14])\nafter res4 #4  : torch.Size([1, 384, 14, 14])\nafter res4 #5  : torch.Size([1, 384, 14, 14])\nafter res4 #6  : torch.Size([1, 384, 14, 14])\nafter res4 #7  : torch.Size([1, 384, 14, 14])\nafter res4 #8  : torch.Size([1, 384, 14, 14])\n\nafter res5 #0  : torch.Size([1, 768, 7, 7])    #(3)\nafter res5 #1  : torch.Size([1, 768, 7, 7])\nafter res5 #2  : torch.Size([1, 768, 7, 7])\n\nafter avgpool  : torch.Size([1, 768, 1, 1])    #(4)\nafter permute  : torch.Size([1, 1, 1, 768])\nafter normpool : torch.Size([1, 1, 1, 768])\nafter permute  : torch.Size([1, 768, 1, 1])\nafter reshape  : torch.Size([1, 768])\nafter fc       : torch.Size([1, 1000])         #(5)<\/code><\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Ending<\/h2>\n<p class=\"wp-block-paragraph\">Well, that was pretty much everything about the theory and the implementation of the ConvNeXt architecture. Again, I do acknowledge that the code I demonstrate above might not fully capture everything since this article is intended to cover the general idea of the model. So, I highly recommend you read the original implementation by Meta\u2019s researchers [2] if you want to know more about the intricate details.<\/p>\n<p class=\"wp-block-paragraph\">I hope you find this article useful. Thanks for reading!<\/p>\n<p class=\"wp-block-paragraph\"><em>P.S. the notebook used in this article is available on my GitHub repo. See the link at reference number [7].<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] Zhuang Liu <em>et al<\/em>. A ConvNet for the 2020s. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/2201.03545\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2201.03545<\/a> [Accessed January 18, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[2] facebookresearch. ConvNeXt. GitHub. <a href=\"https:\/\/github.com\/facebookresearch\/ConvNeXt\/blob\/main\/models\/convnext.py\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/github.com\/facebookresearch\/ConvNeXt\/blob\/main\/models\/convnext.py<\/a> [Accessed January 18, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[3] Kaiming He <em>et al<\/em>. Deep Residual Learning for Image Recognition. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/1512.03385\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/1512.03385<\/a> [Accessed January 18, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[4] Ze Liu <em>et al<\/em>. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/2103.14030\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.14030<\/a> [Accessed January 18, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[5] Saining Xie <em>et al<\/em>. Aggregated Residual Transformations for Deep Neural Networks. Arxiv. <a href=\"https:\/\/arxiv.org\/pdf\/1611.05431\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/1611.05431<\/a> [Accessed January 18, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[6] <a href=\"https:\/\/medium.com\/u\/9801a58700ac\" target=\"_blank\" rel=\"noreferrer noopener\">Muhammad Ardi<\/a>. Paper Walkthrough: Residual Network (ResNet). Python in Plain English. <a href=\"https:\/\/python.plainenglish.io\/paper-walkthrough-residual-network-resnet-62af58d1c521\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/python.plainenglish.io\/paper-walkthrough-residual-network-resnet-62af58d1c521<\/a> [Accessed January 19, 2025].<\/p>\n<p class=\"wp-block-paragraph\">[7] MuhammadArdiPutra. The CNN That Challenges ViT\u200a\u2014\u200aConvNeXt. GitHub. <a href=\"https:\/\/github.com\/MuhammadArdiPutra\/medium_articles\/blob\/main\/The%20CNN%20That%20Challenges%20ViT%20-%20ConvNeXt.ipynb\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/github.com\/MuhammadArdiPutra\/medium_articles\/blob\/main\/The%20CNN%20That%20Challenges%20ViT%20-%20ConvNeXt.ipynb<\/a> [Accessed January 24, 2025].<\/p>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/the-cnn-that-challenges-vit\/\">The CNN That Challenges ViT<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Muhammad Ardi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/the-cnn-that-challenges-vit\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The CNN That Challenges ViT Introduction The invention of ViT (Vision Transformer) causes us to think that CNNs are obsolete.\u200a\u200aBut is this really true? It is widely believed that the impressive performance of ViT comes primarily from its transformer-based architecture. However, researchers from Meta argued that it\u2019s not entirely true. If we take a closer [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,221,67,88,1072,1780],"tags":[267,108,2577],"class_list":["post-3594","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-computer-vision","category-deep-dives","category-deep-learning","category-image-processing","category-neural-network","tag-but","tag-my","tag-vit"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3594"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3594"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3594\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}