{"id":2772,"date":"2025-04-01T07:02:23","date_gmt":"2025-04-01T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/01\/a-simple-implementation-of-the-attention-mechanism-from-scratch\/"},"modified":"2025-04-01T07:02:23","modified_gmt":"2025-04-01T07:02:23","slug":"a-simple-implementation-of-the-attention-mechanism-from-scratch","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/01\/a-simple-implementation-of-the-attention-mechanism-from-scratch\/","title":{"rendered":"A Simple Implementation of the Attention Mechanism from Scratch"},"content":{"rendered":"<p>    A Simple Implementation of the Attention Mechanism from Scratch<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1743469379175\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/towardsdatascience.com\/tag\/attention-mechanism\/\" title=\"Attention Mechanism\">Attention Mechanism<\/a> is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the most important English words that are useful to make a good translation.  <\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/1DtqI-w0CXu29NdXK28TPxQ.png?ssl=1\" alt=\"Attention in RNNs\" class=\"wp-image-600825\"><\/figure>\n<p class=\"wp-block-paragraph\">I will not go into details of RNNs, but attention helped these models to mitigate the vanishing gradient problem and to capture more long-range dependencies among words.<\/p>\n<p class=\"wp-block-paragraph\">At a certain point, we understood that the only important thing was the attention mechanism, and the entire RNN architecture was overkill. Hence, <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention is All You Need!<\/a> <\/p>\n<h2 class=\"wp-block-heading\">Self-Attention in Transformers<\/h2>\n<p class=\"wp-block-paragraph\">Classical attention indicates where words in the output sequence should focus attention in relation to the words in input sequence. This is important in sequence-to-sequence tasks like MT.<\/p>\n<p class=\"wp-block-paragraph\">The <strong>self-attention<\/strong> is a specific type of attention. It operates between any two elements in the same sequence. It provides information on how \u201ccorrelated\u201d the words are in the same sentence.<\/p>\n<p class=\"wp-block-paragraph\">For a given token (or word) in a sequence, self-attention generates a list of attention weights corresponding to all other tokens in the sequence. This process is applied to each token in the sentence, obtaining a matrix of attention weights (as in the picture).<\/p>\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.researchgate.net\/publication\/356242242\/figure\/fig3\/AS:1095657444122624@1638236499150\/Self-attention-weights-from-the-product-MSC-model-for-a-product-titled-Handmade-Vintage.ppm\" alt=\"\" style=\"width:371px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">This is the general idea, in practice things are a bit more complicated because we want to add many learnable parameters to our neural network, let\u2019s see how.<\/p>\n<h2 class=\"wp-block-heading\">K, V, Q representations<\/h2>\n<p class=\"wp-block-paragraph\">Our model input is a sentence like \u201c<em>my name is Marcello Politi\u201d<\/em>. With the process of <strong>tokenization<\/strong>, a sentence is converted into a list of numbers like [2, 6, 8, 3, 1]. <\/p>\n<p class=\"wp-block-paragraph\">Before feeding the sentence into the transformer we need to create a dense representation for each token.<\/p>\n<p class=\"wp-block-paragraph\">How to create this representation? We multiply each token by a matrix. The matrix is learned during training. <\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s add some complexity now.<\/p>\n<p class=\"wp-block-paragraph\">For each token, we create 3 vectors instead of one, we call these vectors: <em>key, value<\/em> and <em>query<\/em>. (We see later how we create these 3 vectors).<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-195.png?ssl=1\" alt=\"\" class=\"wp-image-600812\"><\/figure>\n<p class=\"wp-block-paragraph\">Conceptually these 3 tokens have a particular meaning:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The vector key represents the core information captured by the token<\/li>\n<li class=\"wp-block-list-item\">The vector value captures the full information of a token <\/li>\n<li class=\"wp-block-list-item\">The vector query, it\u2019s a question about the token relevance for the current task. <\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">So the idea is that we focus on a particular token i , and we want to ask what is the importance of the other tokens in the sentence regarding the token i we are taking into consideration. <\/p>\n<p class=\"wp-block-paragraph\">This means that we take the vector q_i (we ask a question regarding i) for token i, and we do some mathematical operations with all the other tokens k_j (j!=i). This is like wondering at first glance what are the other tokens in the sequence that look really important to understand the meaning of token i.<\/p>\n<p class=\"wp-block-paragraph\">What is this magical mathematical operation?<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/substackcdn.com\/image\/fetch\/f_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep\/https%253A%252F%252Fsubstack-post-media.s3.amazonaws.com%252Fpublic%252Fimages%252F8e197d7d-a2c7-4253-992b-256520cdafec_893x349.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">We need to multiply (dot-product) the query vector by the key vectors and divide by a scaling factor. We do this for each k_j token. <\/p>\n<p class=\"wp-block-paragraph\">In this way, we obtain a score for each pair (q_i, k_j). We make this list become a probability distribution by applying a softmax operation on it. Great now we have obtained the <strong>attention weights<\/strong>!<\/p>\n<p class=\"wp-block-paragraph\">With the attention weights, we know what is the importance of each token k_j to for undestandin the token i. So now we multiply the value vector v_j associated with each token per its weight and we sum the vectors. In this way we obtain the final <strong>context-aware vector<\/strong> of <strong>token_i<\/strong>. <\/p>\n<p class=\"wp-block-paragraph\">If we are computing the contextual dense vector of token_1 we calculate:<\/p>\n<p class=\"wp-block-paragraph\">z1 = a11*v1 + a12*v2 + \u2026 + a15*v5<\/p>\n<p class=\"wp-block-paragraph\">Where a1j are the computer attention weights, and v_j are the value vectors. <\/p>\n<p class=\"wp-block-paragraph\">Done! Almost\u2026<\/p>\n<p class=\"wp-block-paragraph\">I didn\u2019t cover how we obtained the vectors k, v and q of each token. We need to define some matrices w_k, w_v and w_q so that when we multiply:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">token * w_k -&gt; k<\/li>\n<li class=\"wp-block-list-item\">token * w_q -&gt; q<\/li>\n<li class=\"wp-block-list-item\">token * w_v -&gt; v <\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These 3 matrices are set at random and are learned during training, this is why we have many parameters in modern models such as LLMs.<\/p>\n<h2 class=\"wp-block-heading\">Multi-head Self-Attention in Transformers (MHSA)<\/h2>\n<p class=\"wp-block-paragraph\">Are we sure that the previous self-attention mechanism is able to capture all important relationships among tokens (words) and create dense vectors of those tokens that really make sense?<\/p>\n<p class=\"wp-block-paragraph\">It could actually not work always perfectly. What if to mitigate the error we re-run the entire thing 2 times with new w_q, w_k and w_v matrices and somehow merge the 2 dense vectors obtained? In this way maybe one self-attention managed to capture some relationship and the other managed to capture some other relationship.<\/p>\n<p class=\"wp-block-paragraph\">Well, this is what exactly happens in MHSA. The case we just discussed contains two heads because it has two sets of w_q, w_k and w_v matrices. We can have even more heads: 4, 8, 16 etc.<\/p>\n<p class=\"wp-block-paragraph\">The only complicated thing is that all these heads are managed in parallel, we process the all in the same computation using tensors.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-196.png?ssl=1\" alt=\"\" class=\"wp-image-600813\"><\/figure>\n<p class=\"wp-block-paragraph\">The way we merge the dense vectors of each head is simple, we concatenate them (hence the dimension of each vector shall be smaller so that when concat them we obtain the original dimension we wanted), and we pass the obtained vector through another w_o learnable matrix. <\/p>\n<h2 class=\"wp-block-heading\">Hands-on <\/h2>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-&lt;a href=\" https: title=\"Python\">Python\"&gt;import torch<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Suppose you have a sentence. After tokenization, each token (word for simplicity) corresponds to an index (number):<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">tokenized_sentence = torch.tensor([\n    2, #my\n    6, #name\n    8, #is\n    3, #marcello\n    1  #politi\n])\ntokenized_sentence<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Before feeding the sentence into the transofrmer we need to create a dense representation for each token.<\/p>\n<p class=\"wp-block-paragraph\">How to create these representation? We multiply each token per a matrix. This matrix is learned during training.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s build this embedding matrix.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">torch.manual_seed(0) # set a fixed seed for reproducibility\nembed = torch.nn.Embedding(10, 16)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If we multiply our tokenized sentence with the embeddings, we obtain a dense representation of dimension 16 for each token<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">sentence_embed = embed(tokenized_sentence).detach()\nsentence_embed<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In order to use the attention mechanism we need to create 3 new We define 3 matrixes w_q, w_k and w_v. When we multiply one input token time the w_q we obtain the vector q. Same with w_k and w_v.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)\n\nw_key = torch.rand(d,d)\nw_query = torch.rand(d,d)\nw_value = torch.rand(d,d)<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong>Compute attention weights<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s now compute the attention weights for only the first input token of the sentence.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">token1_embed = sentence_embed[0]\n\n#compute the tre vector associated to token1 vector : q,k,v\nkey_1 = w_key.matmul(token1_embed)\nquery_1 = w_query.matmul(token1_embed)\nvalue_1 = w_value.matmul(token1_embed)\n\nprint(\"key vector for token1: n\", key_1)   \nprint(\"query vector for token1: n\", query_1)\nprint(\"value vector for token1: n\", value_1)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We need to multiply the query vector associated to token1 (query_1) with all the keys of the other vectors.<\/p>\n<p class=\"wp-block-paragraph\">So now we need to compute all the keys (key_2, key_2, key_4, key_5). But wait, we can compute all of these in one time by multiplying the sentence_embed times the w_k matrix.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">keys = sentence_embed.matmul(w_key.T)\nkeys[0] #contains the key vector of the first token and so on<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s do the same thing with the values<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">values = sentence_embed.matmul(w_value.T)\nvalues[0] #contains the value vector of the first token and so on<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s compute the first part of the attions formula.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/substackcdn.com\/image\/fetch\/f_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep\/https%253A%252F%252Fsubstack-post-media.s3.amazonaws.com%252Fpublic%252Fimages%252F8e197d7d-a2c7-4253-992b-256520cdafec_893x349.png?ssl=1\" alt=\"\"><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import torch.nn.functional as F<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># the following are the attention weights of the first tokens to all the others\na1 = F.softmax(query_1.matmul(keys.T)\/d**0.5, dim = 0)\na1<\/code><\/pre>\n<p class=\"wp-block-paragraph\">With the attention weights we know what is the importance of each token. So now we multiply the value vector associated to each token per its weight.<\/p>\n<p class=\"wp-block-paragraph\">To obtain the final context aware vector of token_1.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">z1 = a1.matmul(values)\nz1<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In the same way we could compute the context aware dense vectors of all the other tokens. Now we are always using the same matrices w_k, w_q, w_v. We say that we use one head.<\/p>\n<p class=\"wp-block-paragraph\">But we can have multiple triplets of matrices, so multi-head. That\u2019s why it is called multi-head attention.<\/p>\n<p class=\"wp-block-paragraph\">The dense vectors of an input tokens, given in oputut from each head are at then end concatenated and linearly transformed to get the final dense vector.<\/p>\n<h4 class=\"wp-block-heading\">Implementing MultiheadSelf-Attention<\/h4>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ntorch.manual_seed(0) # fixed seed for reproducibility<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Same steps as before\u2026<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Tokenized sentence (same as yours)\ntokenized_sentence = torch.tensor([2, 6, 8, 3, 1])  # [my, name, is, marcello, politi]\n\n# Embedding layer: vocab size = 10, embedding dim = 16\nembed = nn.Embedding(10, 16)\nsentence_embed = embed(tokenized_sentence).detach()  # Shape: [5, 16] (seq_len, embed_dim)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We\u2019ll define a multi-head attention mechanism with h heads (let\u2019s say 4 heads for this example). Each head will have its own w_q, w_k, and w_v matrices, and the output of each head will be concatenated and passed through a final linear layer.<\/p>\n<p class=\"wp-block-paragraph\">Since the output of the head will be concatenated, and we want a final dimension of d, the dimension of each head needs to be d\/h. Additionally each concatenated vector will go though a linear transformation, so we need another matrix w_ouptut as you can see in the formula.<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">d = sentence_embed.shape[1]  # embed dimension 16\nh = 4  # Number of heads\nd_k = d \/\/ h  # Dimension per head (16 \/ 4 = 4)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Since we have 4 heads, we want 4 copies for each matrix. Instead of copies, we add a dimension, which is the same thing, but we only do one operation. (Imagine stacking matrices on top of each other, its the same thing).<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Define weight matrices for each head\nw_query = torch.rand(h, d, d_k)  # Shape: [4, 16, 4] (one d x d_k matrix per head)\nw_key = torch.rand(h, d, d_k)    # Shape: [4, 16, 4]\nw_value = torch.rand(h, d, d_k)  # Shape: [4, 16, 4]\nw_output = torch.rand(d, d)  # Final linear layer: [16, 16]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I\u2019m using for simplicity torch\u2019s einsum. If you\u2019re not familiar with it check out my\u00a0<a href=\"https:\/\/towardsdatascience.com\/understanding-einsteins-notation-and-einsum-multiplication-a690bd4da0b2\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The einsum operation <code>torch.einsum('sd,hde-&gt;hse', sentence_embed, w_query)<\/code> in PyTorch uses letters to define how to multiply and rearrange numbers. Here\u2019s what each part means:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Input Tensors:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>sentence_embed<\/code> with the notation <code>'sd'<\/code>:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>s<\/code> represents the number of words (sequence length), which is 5.<\/li>\n<li class=\"wp-block-list-item\">\n<code>d<\/code> represents the number of numbers per word (embedding size), which is 16.<\/li>\n<li class=\"wp-block-list-item\">The shape of this tensor is <code>[5, 16]<\/code>.<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">\n<code>w_query<\/code> with the notation <code>'hde'<\/code>:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>h<\/code> represents the number of heads, which is 4.<\/li>\n<li class=\"wp-block-list-item\">\n<code>d<\/code> represents the embedding size, which again is 16.<\/li>\n<li class=\"wp-block-list-item\">\n<code>e<\/code> represents the new number size per head (d_k), which is 4.<\/li>\n<li class=\"wp-block-list-item\">The shape of this tensor is <code>[4, 16, 4]<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Output Tensor:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The output has the notation <code>'hse'<\/code>:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>h<\/code> represents 4 heads.<\/li>\n<li class=\"wp-block-list-item\">\n<code>s<\/code> represents 5 words.<\/li>\n<li class=\"wp-block-list-item\">\n<code>e<\/code> represents 4 numbers per head.<\/li>\n<li class=\"wp-block-list-item\">The shape of the output tensor is <code>[4, 5, 4]<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Compute Q, K, V for all tokens and all heads\n# sentence_embed: [5, 16] -&gt; Q: [4, 5, 4] (h, seq_len, d_k)\nqueries = torch.einsum('sd,hde-&gt;hse', sentence_embed, w_query)  # h heads, seq_len tokens, d dim\nkeys = torch.einsum('sd,hde-&gt;hse', sentence_embed, w_key)       # h heads, seq_len tokens, d dim\nvalues = torch.einsum('sd,hde-&gt;hse', sentence_embed, w_value)   # h heads, seq_len tokens, d dim<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This einsum equation performs a dot product between the queries (hse) and the transposed keys (hek) to obtain scores of shape [h, seq_len, seq_len], where:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">h -&gt; Number of heads.<\/li>\n<li class=\"wp-block-list-item\">s and k -&gt; Sequence length (number of tokens).<\/li>\n<li class=\"wp-block-list-item\">e -&gt; Dimension of each head (d_k).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The division by (d_k ** 0.5) scales the scores to stabilize gradients. Softmax is then applied to obtain attention weights:<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Compute attention scores\nscores = torch.einsum('hse,hek-&gt;hsk', queries, keys.transpose(-2, -1)) \/ (d_k ** 0.5)  # [4, 5, 5]\nattention_weights = F.softmax(scores, dim=-1)  # [4, 5, 5]<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Apply attention weights\nhead_outputs = torch.einsum('hij,hjk-&gt;hik', attention_weights, values)  # [4, 5, 4]\nhead_outputs.shape<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now we concatenate all the heads of token 1<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\"># Concatenate heads\nconcat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1)  # [5, 16]\nconcat_heads.shape<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s finally multiply per the last w_output matrix as in the formula above<\/p>\n<pre class=\"wp-block-prismatic-blocks \/* Base styles for the code block *\/ pre[class*=&quot;language-&quot;], code[class*=&quot;language-&quot;] {  background: #2d2d2d; color: #f8f8f2; font-family: 'Courier New', Courier, monospace; padding: 1em; border-radius: 8px; overflow: auto; line-height: 1.5; } Line numbers .line-numbers .line-numbers-rows 0.5em 0; #2a2a2a; border-right: 3px solid #444; &gt; span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d\"><code class=\"language-python\">multihead_output = concat_heads.matmul(w_output)  # [5, 16] @ [16, 16] -&gt; [5, 16]\nprint(\"Multi-head attention output for token1:n\", multihead_output[0])<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Final Thoughts<\/h2>\n<p class=\"wp-block-paragraph\">In this blog post I\u2019ve implemented a simple version of the attention mechanism. This is not how it is really implemented in modern frameworks, but my scope is to provide some insights to allow anyone an understanding of how this works. In future articles I\u2019ll go through the entire implementation of a transformer architecture.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a516\">Follow me on\u00a0<a href=\"https:\/\/towardsdatascience.com\/author\/marcellopoliti\/\">TDS<\/a>\u00a0if you like this article! <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f601.png?ssl=1\" alt=\"\ud83d\ude01\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/p>\n<p class=\"wp-block-paragraph\" id=\"f739\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f4bc.png?ssl=1\" alt=\"\ud83d\udcbc\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/marcello-politi\/\" rel=\"noreferrer noopener\" target=\"_blank\">Linkedin<\/a>\u00a0| <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f426.png?ssl=1\" alt=\"\ud83d\udc26\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">\u00a0<a href=\"https:\/\/x.com\/Marcello_AI\" rel=\"noreferrer noopener\" target=\"_blank\">X (Twitter)<\/a>\u00a0|\u00a0<a href=\"https:\/\/emojiterra.com\/laptop-computer\/\" rel=\"noreferrer noopener\" target=\"_blank\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f4bb.png?ssl=1\" alt=\"\ud83d\udcbb\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/a>\u00a0<a href=\"https:\/\/marcello-politi.super.site\/\" rel=\"noreferrer noopener\" target=\"_blank\">Website<\/a><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\">Unless otherwise noted, images are by the author<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/a-simple-implementation-of-the-attention-mechanism-from-scratch\/\">A Simple Implementation of the Attention Mechanism from Scratch<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Marcello Politi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/a-simple-implementation-of-the-attention-mechanism-from-scratch\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Simple Implementation of the Attention Mechanism from Scratch Introduction The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,2222,88,70,157,2223],"tags":[960,2224,2225],"class_list":["post-2772","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-attention-mechanism","category-deep-learning","category-machine-learning","category-python","category-self-attention","tag-attention","tag-sequence","tag-words"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2772"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2772"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2772\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2772"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2772"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2772"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}