{"id":3371,"date":"2025-04-26T07:02:44","date_gmt":"2025-04-26T07:02:44","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/26\/behind-the-magic-how-tensors-drive-transformers\/"},"modified":"2025-04-26T07:02:44","modified_gmt":"2025-04-26T07:02:44","slug":"behind-the-magic-how-tensors-drive-transformers","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/26\/behind-the-magic-how-tensors-drive-transformers\/","title":{"rendered":"Behind the Magic: How Tensors Drive Transformers"},"content":{"rendered":"<p>    Behind the Magic: How Tensors Drive Transformers<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1745540862691\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\">Transformers have changed the way artificial intelligence works, especially in understanding language and learning from data. At the core of these models are <strong>tensors<\/strong> (a generalized type of mathematical matrices that help process information) . As data moves through the different parts of a Transformer, these tensors are subject to different transformations that help the model make sense of things like sentences or images. Learning how tensors work inside Transformers can help you understand how today\u2019s smartest AI systems actually work and think.<\/p>\n<h2 class=\"wp-block-heading\">What This Article Covers and What It\u00a0Doesn\u2019t<br \/>\n<\/h2>\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/2705.png?ssl=1\" alt=\"\u2705\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> <strong>This Article IS About:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The flow of tensors from input to output within a Transformer model.<\/li>\n<li class=\"wp-block-list-item\">Ensuring dimensional coherence throughout the computational process.<\/li>\n<li class=\"wp-block-list-item\">The step-by-step transformations that tensors undergo in various Transformer layers.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/274c.png?ssl=1\" alt=\"\u274c\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> <strong>This Article IS NOT About:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">A general introduction to Transformers or deep learning.<\/li>\n<li class=\"wp-block-list-item\">Detailed architecture of Transformer models.<\/li>\n<li class=\"wp-block-list-item\">Training process or hyper-parameter tuning of Transformers.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">How Tensors Act Within Transformers<\/h2>\n<p class=\"wp-block-paragraph\">A Transformer consists of two main components:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Encoder:<\/strong> Processes input data, capturing contextual relationships to create meaningful representations.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Decoder:<\/strong> Utilizes these representations to generate coherent output, predicting each element sequentially.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Tensors are the fundamental data structures that go through these components, experiencing multiple transformations that ensure dimensional coherence and proper information flow.<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-119.png?ssl=1\" alt=\"\" class=\"wp-image-601995\"><figcaption class=\"wp-element-caption\">Image From Research Paper: Transformer standard archictecture<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Input Embedding Layer<\/h2>\n<p class=\"wp-block-paragraph\">Before entering the Transformer, raw input tokens (words, subwords, or characters) are converted into dense vector representations through the <strong>embedding layer<\/strong>. This layer functions as a lookup table that maps each token vector, capturing semantic relationships with other words.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1KKqDEGM30qwWfG5CpZvQWQ.png?ssl=1\" alt=\"\" class=\"wp-image-602001\"><figcaption class=\"wp-element-caption\">Image by author: Tensors passing through Embedding layer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">For a batch of five sentences, each with a sequence length of 12 tokens, and an embedding dimension of 768, the tensor shape is:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Tensor shape:<\/strong> <code>[batch_size, seq_len, embedding_dim] \u2192 [5, 12, 768]<\/code>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">After embedding, <strong>positional encoding<\/strong> is added, ensuring that order information is preserved without altering the tensor shape.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/0jL25lqc5PnSnsTuh.png?ssl=1\" alt=\"\" class=\"wp-image-602003\"><figcaption class=\"wp-element-caption\">Modified Image from Research Paper: Situation of the\u00a0workflow<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Multi-Head Attention Mechanism<br \/>\n<\/h2>\n<p class=\"wp-block-paragraph\">One of the most critical components of the Transformer is the <strong>Multi-Head Attention (MHA) mechanism<\/strong>. It operates on three matrices derived from input embeddings:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Query (Q)<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>Key (K)<\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong>Value (V)<\/strong><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These matrices are generated using learnable weight matrices:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Wq, Wk, Wv<\/strong> of shape <code>[embedding_dim, d_model]<\/code> (e.g., <code>[768, 512]<\/code>).<\/li>\n<li class=\"wp-block-list-item\">The resulting Q, K, V matrices have dimensions\u00a0<br \/><code>[batch_size, seq_len, d_model]<\/code>.<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/10nejflECL6KXEDkFnVa8xw.png?ssl=1\" alt=\"\" class=\"wp-image-602009\"><figcaption class=\"wp-element-caption\">Image by author: Table showing the shapes\/dimensions of Embedding, Q, K, V\u00a0tensors<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Splitting Q, K, V into Multiple\u00a0Heads<br \/>\n<\/h2>\n<p class=\"wp-block-paragraph\">For effective parallelization and improved learning, MHA splits Q, K, and V into multiple heads. Suppose we have 8 attention heads:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Each head operates on a subspace of <code>d_model \/ head_count<\/code>.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-120.png?ssl=1\" alt=\"\" class=\"wp-image-601996\"><figcaption class=\"wp-element-caption\">Image by author: Multihead Attention<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The reshaped tensor dimensions are <code>[batch_size, seq_len, head_count, d_model \/ head_count]<\/code>.<\/li>\n<li class=\"wp-block-list-item\">Example: <code>[5, 12, 8, 64]<\/code> \u2192 rearranged to <code>[5, 8, 12, 64]<\/code> to ensure that each head receives a separate sequence slice.<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1zE0zzeWERCqdqlTtl8VmOw.png?ssl=1\" alt=\"\" class=\"wp-image-602006\"><figcaption class=\"wp-element-caption\">Image by author: Reshaping the\u00a0tensors<\/figcaption><\/figure>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">So each head will get the its share of Qi, Ki, Vi<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1PGZ8nbb2zUDby4ee8t3YUQ.png?ssl=1\" alt=\"\" class=\"wp-image-602002\"><figcaption class=\"wp-element-caption\">Image by author: Each Qi,Ki,Vi sent to different head<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Attention Calculation<\/h2>\n<p class=\"wp-block-paragraph\">Each head computes attention using the formula:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1sWm_ysBMu7G81YmaAXviTQ.png?ssl=1\" alt=\"\" class=\"wp-image-602000\"><\/figure>\n<p class=\"wp-block-paragraph\">Once attention is computed for all heads, the outputs are concatenated and passed through a linear transformation, restoring the initial tensor shape.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1BejFx0U948P_NqaYkbik_Q.png?ssl=1\" alt=\"\" class=\"wp-image-602004\"><figcaption class=\"wp-element-caption\">Image by author: Concatenating the output of all\u00a0heads<\/figcaption><\/figure>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/0uZ6sOiaPC6Ja-wBO.png?ssl=1\" alt=\"\" class=\"wp-image-602007\"><figcaption class=\"wp-element-caption\">Modified Image From Research Paper: Situation of the\u00a0workflow<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Residual Connection and Normalization<\/h2>\n<p class=\"wp-block-paragraph\">After the multi-head attention mechanism, a <strong>residual connection<\/strong> is added, followed by <strong>layer normalization<\/strong>:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Residual connection: <code>Output = Embedding Tensor + Multi-Head Attention Output<\/code>\n<\/li>\n<li class=\"wp-block-list-item\">Normalization: <code>(Output \u2212 \u03bc) \/ \u03c3<\/code> to stabilize training<\/li>\n<li class=\"wp-block-list-item\">Tensor shape remains <code>[batch_size, seq_len, embedding_dim]<\/code>\n<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/0km8SGQccMbUGJC5q.png?ssl=1\" alt=\"\" class=\"wp-image-601998\" style=\"width:300px;height:auto\"><figcaption class=\"wp-element-caption\">Image by author: Residual Connection<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Feed-Forward Network\u00a0(FFN)<\/h2>\n<p class=\"wp-block-paragraph\">In the decoder, <strong>Masked Multi-Head Attention<\/strong> ensures that each token attends only to previous tokens, preventing leakage of future information.<\/p>\n<figure class=\"wp-block-image is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/14akssK3Kf4f5pHLMzZLaEg.png?ssl=1\" alt=\"\" class=\"wp-image-602008\" style=\"width:680px;height:auto\"><figcaption class=\"wp-element-caption\">Modified Image From Research Paper: Masked Multi Head Attention<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This is achieved using a lower triangular mask of shape <code>[seq_len, seq_len]<\/code> with <code>-inf<\/code> values in the upper triangle. Applying this mask ensures that the Softmax function nullifies future positions.<\/p>\n<figure class=\"wp-block-image aligncenter is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/0oIm0vSrgdWWPLUHf.png?ssl=1\" alt=\"\" class=\"wp-image-601999\" style=\"width:504px;height:auto\"><figcaption class=\"wp-element-caption\">Image by author: Mask matrix<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Cross-Attention in\u00a0Decoding<\/h2>\n<p class=\"wp-block-paragraph\">Since the decoder does not fully understand the input sentence, it utilizes <strong>cross-attention<\/strong> to refine predictions. Here:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The decoder generates queries <strong>(Qd)<\/strong> from its input (<code>[batch_size, target_seq_len, embedding_dim]<\/code>).<\/li>\n<li class=\"wp-block-list-item\">The encoder output serves as keys <strong>(Ke)<\/strong> and values <strong>(Ve)<\/strong>.<\/li>\n<li class=\"wp-block-list-item\">The decoder computes attention between <strong>Qd<\/strong> and <strong>Ke<\/strong>, extracting relevant context from the encoder\u2019s output.<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/19vHxYETrWQQR0B6rrltXZg.png?ssl=1\" alt=\"\" class=\"wp-image-602005\"><figcaption class=\"wp-element-caption\">Modified Image From Research Paper: Cross Head Attention<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">Transformers use <strong>tensors<\/strong> to help them learn and make smart decisions. As the data moves through the network, these tensors go through different steps\u2014like being turned into numbers the model can understand (embedding), focusing on important parts (attention), staying balanced (normalization), and being passed through layers that learn patterns (feed-forward). These changes keep the data in the right shape the whole time. By understanding how tensors move and change, we can get a better idea of how <a href=\"https:\/\/towardsdatascience.com\/tag\/ai-models\/\" title=\"AI models\">AI models<\/a> work and how they can understand and create human-like language.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/behind-the-magic-how-tensors-drive-transformers\/\">Behind the Magic: How Tensors Drive Transformers<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Ziad SALLOUM<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/behind-the-magic-how-tensors-drive-transformers\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Behind the Magic: How Tensors Drive Transformers Introduction Transformers have changed the way artificial intelligence works, especially in understanding language and learning from data. At the core of these models are tensors (a generalized type of mathematical matrices that help process information) . As data moves through the different parts of a Transformer, these tensors [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2487,62,2488,71,1463,2489,1287],"tags":[2490,1963,648],"class_list":["post-3371","post","type-post","status-publish","format-standard","hentry","category-ai-models","category-aimldsaimlds","category-embedding","category-large-language-models","category-multi-head-attention","category-tensor","category-transformers","tag-tensors","tag-transformer","tag-transformers"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3371"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3371"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3371\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}