{"id":1197,"date":"2025-01-15T07:02:37","date_gmt":"2025-01-15T07:02:37","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/15\/deep-dive-into-kv-caching-in-mistral-7e0cea8409a1\/"},"modified":"2025-01-15T07:02:37","modified_gmt":"2025-01-15T07:02:37","slug":"deep-dive-into-kv-caching-in-mistral-7e0cea8409a1","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/15\/deep-dive-into-kv-caching-in-mistral-7e0cea8409a1\/","title":{"rendered":"Deep Dive into KV-Caching In Mistral"},"content":{"rendered":"<p>    Deep Dive into KV-Caching In Mistral<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Ever wondered why the time to first token in LLMs is high but subsequent tokens are super\u00a0fast?<\/h4>\n<p>In this post, I dive into the details of KV-Caching used in Mistral, a topic I initially found quite daunting. However, as I delved deeper, it became a fascinating subject, especially when it explained why the time to first token (TTFT) in these language models is generally high\u200a\u2014\u200aa pattern I noticed during countless API calls\u00a0\ud83d\ude42.<\/p>\n<p>I\u2019ll cover:<\/p>\n<ol>\n<li><strong>What exactly is KV-Caching?<\/strong><\/li>\n<li><strong>The concept of the rolling cache\u00a0buffer<\/strong><\/li>\n<li><strong>The prefill and decode\u00a0stages<\/strong><\/li>\n<li><strong>Formulating attention masks with the help of the xFormers\u00a0library<\/strong><\/li>\n<\/ol>\n<h3><strong>KV-Caching: Avoiding Redundant Computations<\/strong><\/h3>\n<p>Imagine our input token sequence as <em>x1<\/em>, <em>x2<\/em>, <em>x3<\/em>\u00a0\u2026 <em>xt<\/em>, and we\u2019re determining the output at time step <em>t<\/em>. To find the attention output (at each transformer layer), we need the dot product of the current token\u2019s query vector with the key vectors of the current and preceding tokens. After normalizing via softmax, these become the attention weights over the value vectors. Here are two key observations:<\/p>\n<ol>\n<li>\n<strong>Single Token Decoding:<\/strong> Decoding happens one token at a time. We\u2019re only interested in the self-attention output for the current token, focusing solely on its query vector, not query vectors of other\u00a0tokens.<\/li>\n<li>\n<strong>Precomputed Keys and Values:<\/strong> We need the dot product with the keys of preceding tokens, which were already computed when calculating the self-attention output of the token at time step <em>t\u22121<\/em>. The same goes for the value\u00a0vectors.<\/li>\n<\/ol>\n<p>The dimensions of the key quantities are as\u00a0follows:<\/p>\n<ol>\n<li>\n<strong>Token Embedding Vectors:<\/strong>\u00a0dim<\/li>\n<li>\n<strong>Dimension of Query, Key, Value Heads:<\/strong>\u00a0head_dim<\/li>\n<li>\n<strong>Number of Query Heads:<\/strong>\u00a0n_heads<\/li>\n<li>\n<strong>Number of Key and Value Heads:<\/strong> n_kv_heads<\/li>\n<li>\n<strong>Number of Transformer Layers:<\/strong>\u00a0n_layers<\/li>\n<\/ol>\n<p>(Note: Mistral uses grouped query attention where for each token, 4 of its query vectors attend to the same key-value pair. With n_heads=32, we have n_kv_heads=32\/4=8)<\/p>\n<p><strong>In the unoptimized implementation:<\/strong><\/p>\n<p>Assuming a single transformer layer, at each time step, we calculate the query for the current token, and the key and value vectors for both the current and preceding tokens. This process involves three matrix multiplications.<\/p>\n<p>a. <strong>Query Calculation (Q):<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/571\/1%2ARNf4R7jSWCATzRHAB6Gy2A.png?ssl=1\"><\/figure>\n<p>b. <strong>Key<\/strong> <strong>Calculation (K):<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/614\/1%2A-zrUy3w8n4n4joqDELgUGA.png?ssl=1\"><\/figure>\n<p>c. <strong>Value<\/strong> <strong>Calculation (V):<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/614\/1%2AV0lL-quyUUIhR_UTwpcLYQ.png?ssl=1\"><\/figure>\n<p>Once we have the query, key and value vectors we can then proceed to compute the attention output using\u00a0\u2014<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/306\/1%2AtUC-OWJkD1HtB8yV1uNAeg.png?ssl=1\"><\/figure>\n<p><strong>In the optimized implementation:<\/strong><\/p>\n<p>However, as mentioned in point 2, the keys and values of tokens up to time step <em>t\u22121<\/em> would have already been computed when determining the output at time step <em>t\u22121<\/em>. This means we can avoid redundant computations by storing the keys and values of tokens up to time step\u00a0<em>t\u22121<\/em>.<\/p>\n<blockquote><p>Note: Mistral uses a sliding window attention mechanism, so we only attend to a specific number of previous tokens. More details on this will be covered\u00a0later.<\/p><\/blockquote>\n<p>What this means is that during decoding, we compute the key and value vectors only for the current token and not for the previous ones. So, operations (b) and (c) above are performed for just one token instead of <em>t <\/em>tokens. Specifically:<\/p>\n<p><strong>Key Calculation (K):<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/614\/1%2A-bSv9GBoYdOnn7vpQTZ5rg.png?ssl=1\"><\/figure>\n<p><strong>Value Calculation (V):<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/614\/1%2AfyAb8QmszuIYyNeeLeeO0g.png?ssl=1\"><\/figure>\n<p><strong>FLOPS Saved<\/strong><\/p>\n<p>At every step of decoding, we save 2*(t-1)*n_kv_heads*dim\u00b2 FLOPS. For a sequence of length T, this translates to savings of 2*(T*(T-1)\/2)*n_kv_heads*dim\u00b2FLOPS.<\/p>\n<p>Considering we\u2019ve assumed a single transformer layer, and knowing that Mistral utilizes 32 transformer layers, the savings are multiplied by 32. This is significant!<\/p>\n<p>For a typical sequence length of 10,000 tokens, with n_kv_heads=8 and dim=4096, we get 4.294e+17 FLOPS (10000*10000*8*4096*4096*32)<\/p>\n<p>An Nvidia A100 GPU has approximately 312e+12 FLOPS, meaning we would save around 23 minutes in generating this sequence of 10,000\u00a0tokens!<\/p>\n<blockquote><p>\n<strong>Note:<\/strong> This is a simplified calculation to give an idea of the benefits, which are indeed substantial. Actual improvements will depend on various factors such as maximum feasible cache size, GPU memory, parallelization with multiple GPUs,\u00a0etc.<\/p><\/blockquote>\n<p>Now that we understand the KV cache, I\u2019ll discuss how we leverage it during output generation!<\/p>\n<h3><strong>Prefill and Decode\u00a0Stages<\/strong><\/h3>\n<p>First, let\u2019s establish some terminology used by\u00a0Mistral:<\/p>\n<ol>\n<li>\n<strong>Sliding Window Attention (SWA):<\/strong> Mistral uses SWA, meaning each token attends to itself and the previous W\u22121 tokens, where W is the window\u00a0size.<\/li>\n<li>\n<strong>KV Cache Size:<\/strong> We set our KV Cache to size W. This means we can store W key vectors and W value vectors in the cache. This ensures we have the necessary context to compute the self-attention output for the next\u00a0token.<\/li>\n<li>\n<strong>Chunk Size:<\/strong> We process user input prompt sequences also W tokens at a time (more on this in the next section on Prefill). This chunk size limits GPU memory usage. Self-attention requires K, Q, and V to be on the GPU, and these grow with the input size, making it impractical to process the entire input sequence in one\u00a0batch.<\/li>\n<\/ol>\n<blockquote><p>Note:<\/p><\/blockquote>\n<blockquote><p>Each transformer layer in Mistral has its own separate KV\u00a0Cache.<\/p><\/blockquote>\n<blockquote><p>At first, it might seem (it did to me!) that calculating and caching only the keys and values of the last <em>W-1<\/em> tokens in the input sequence would be sufficient to generate the first output token. However, that\u2019s not the case! This is because Mistral has more than one transformer layer. To compute the output from the second layer of our next token, we need the output of the last W\u22121 tokens in the first layer, which in turn depends on the last (2W\u22121) input tokens (similar to receptive field in\u00a0CNNs!)<\/p><\/blockquote>\n<p>Mistral uses a window size of W = 4096\u00a0tokens.<\/p>\n<h3>Part 1: Prefill\u00a0Stage<\/h3>\n<p>The input to these models usually starts with user-provided tokens (the well-known user prompt \ud83d\ude0a), followed by the generation of output tokens. The stage where we populate the KV-cache with the keys and values from the user prompt, so we can use them when generating output tokens, is called the <strong>prefill stage<\/strong>. This is the key reason why the time to first token (TTFT) is generally high.<\/p>\n<p>To understand the workings of the prefill stage, let\u2019s walk through an\u00a0example:<\/p>\n<p>Imagine we have 3 sequences in our inference batch with user prompt token lengths of 4, 1, and 3 respectively. Suppose we have a window size W=3, and we want to generate the next 5 tokens for each sequence.<\/p>\n<p>Given:<\/p>\n<ol>\n<li>\n<strong>seqlens<\/strong> =\u00a0[4,1,3]<\/li>\n<li>\n<strong>sliding_window_size = cache_size<\/strong> =\u00a03<\/li>\n<li>\n<strong>chunk_size<\/strong> = 2 (for illustration purposes, ideally this would also be = W = 3 as mentioned before)<\/li>\n<\/ol>\n<p>In the prefill stage, since we already have all the input tokens, we can process them in parallel. With a chunk_size of 2 we require two iterations as explained below.<\/p>\n<h3><strong>Iteration 1<\/strong><\/h3>\n<p>We have a chunk size of 2, so we\u2019ll process the first 2 tokens from each sequence. This means the sequence lengths under consideration for this step are\u00a0[2,1,2].<\/p>\n<p>To batch the 3 sequences, one approach is to pad the shorter sequences to match the longest sequence. However, if the sequences vary greatly in length, padding results in a lot of wasted memory. Hence, this approach is generally not\u00a0used.<\/p>\n<p>The preferred approach is to concatenate all the sequences in the batch into a single larger sequence. We will create an appropriate attention mask so that tokens attend only to those within the same sequence.<\/p>\n<p>This implies our input shape is: [2+1+2,dim] =\u00a0[5,dim]<\/p>\n<p>We compute our Q, K, and V vectors for this input by multiplying with matrices Wq, Wk, and Wv. Assuming the number of heads = 1 for simplicity, the outputs will have the following shapes:<\/p>\n<p>a. Q: [5, head_dim]<\/p>\n<p>b. K: [5, head_dim]<\/p>\n<p>c. V: [5, head_dim]<\/p>\n<p>Next, we add rotary positional encodings to our Q and K\u00a0vectors.<\/p>\n<p>With these preparations, we are ready to calculate the self-attention output!<\/p>\n<h4>Step 1: Retrieve from KV-Cache and Compute Attention<\/h4>\n<p>Since this is the first chunk, we look at the KV-cache and find it empty\u200a\u2014\u200ano vectors stored there. This means there are no previous tokens to attend to, only the current token itself. Consequently, the number of key-value vectors (kv_seqlen) matches the number of query vectors (q_seqlen) in each sequence.<\/p>\n<p>To handle this, we create our mask using the BlockDiagonalCausalMask from the xFormers library like\u00a0so:<\/p>\n<pre>mask = BlockDiagonalCausalMask.from_seqlens(q_seqlen = [2,1,2], kv_seqlen=[2,1,2]).make_local_attention(window_size=3)<\/pre>\n<p>The attention mask can be visualized using<\/p>\n<pre>mask.materialize(shape=(5,5)).exp()<br># The 'shape' argument is obtained as follows: the first dimension is the total number of query vectors and the second dimension is the total number of key\/value vectors<\/pre>\n<p>and the output\u00a0is<\/p>\n<pre>[[1., 0., 0., 0., 0.],<br> [1., 1., 0., 0., 0.],<br> [0., 0., 1., 0., 0.],<br> [0., 0., 0., 1., 0.],<br> [0., 0., 0., 1., 1.]]<\/pre>\n<p>Let\u2019s understand how we obtained this mask and why it makes sense. Focus on <em>q_seqlen = [2,1,2]<\/em> and <em>kv_seqlen=[2,1,2]<\/em>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/457\/1%2AgU16U2iOKlWcoA7G-WoSPA.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<p>The first sequence has 2 query vectors and 2 key-value (kv) vectors. The attention mask for this sequence is the 2&#215;2 matrix in the top\u00a0left:<\/p>\n<pre>[[1,0],<br>[1,1]]<\/pre>\n<p>The second element in the first row is 0 because this is a causal mask, and we do not want the first token to attend to the second token (in the\u00a0future).<\/p>\n<p>The second sequence has just 1 query and 1 kv vector, represented by the center 1&#215;1 matrix. The third sequence, similar to the first, has an identical 2&#215;2 matrix in the bottom\u00a0right.<\/p>\n<p>Notice that the attention masks for the sequences are logically concatenated along the diagonal.<\/p>\n<p>Setting the window size to 3 in our mask creation ensures that we only consider up to 3 tokens for attention per sequence.<\/p>\n<p>This mask is applied to the output of the matrix product of Q and K.T. Thus, dot products of queries and keys from different sequences are nullified by the 0s in the combined attention matrix, preserving causality.<\/p>\n<blockquote><p>Note: Under the hood, xFormers does not calculate those dot products at all that would be nullified by the 0s by the attention mask<\/p><\/blockquote>\n<p>The BlockDiagonalCausalMask in xFormers starts filling 1s from the top-left of each block, which is exactly what we need for our first\u00a0prefill.<\/p>\n<h4>Step 2: Cache\u00a0Update<\/h4>\n<p>Next, we update the cache with the computed keys and values. Our cache size is initialized to W\u00d7batch_size=W\u00d73 that is one for each sequence and one each for key and values. This is a rolling cache meaning tokens in the first sequence will use up cache positions [0, 1, 2, 0, 1, 2\u00a0\u2026], tokens in the second sequence will use up cache positions [3, 4, 5, 3, 4, 5\u00a0\u2026] and tokens in the third sequence will use up cache positions [6, 7, 8, 6, 7, 8\u00a0\u2026].<\/p>\n<p>So, our KV-Cache after the first iteration (on processing 2, 1 and 2 number of tokens from each sequence) looks like\u00a0this:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQiYyKRfTYX_p2rFY3vb8Jw.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<h3>Iteration 2<\/h3>\n<p>We now move on to the remaining part of our sequences. The remaining tokens to process for each sequence are [2, 0, 1]. In Mistral code, this stage is referred to as the \u2018subsequent prefill\u2019\u00a0stage.<\/p>\n<h4>Step 1: Retrieve from KV-Cache and Compute Attention<\/h4>\n<p>As in iteration 1, we first look at the KV-cache but now we find entries in them. We retrieve the entries and perform and an unroll\/unrotate step on them to restore the correct sequence order. Why do we do\u00a0this?<\/p>\n<p>Remember, this is a rolling cache. If we had processed, say, 5 tokens, the queries and values for the 4th and 5th tokens would occupy the first two cache positions, followed by those of the 3rd token. After unrolling, we would have the queries and values of the 3rd, 4th, and 5th tokens in that order. However, in this case, since we haven\u2019t processed more than 3 tokens, the current cache order matches the token\u00a0order.<\/p>\n<blockquote><p>Note: The reason we need to unrotate is that during the prefill stage, we process multiple tokens per sequence and we need to identify which queries should attend to which keys in the sequence. In contrast, during the decode stage (described in the following section), we process only one token of a sequence at a time. In that case, unrotation isn\u2019t necessary because this single token will attend to all elements in the\u00a0cache.<\/p><\/blockquote>\n<p>Currently, the number of query vectors for each sequence is [2, 0, 1]. The number of key vectors is calculated as the number of query vectors plus the number of valid entries in the\u00a0cache:<\/p>\n<p>kv_seqlen = [2+2, 0+1, 1+2] = [4, 1,\u00a03]<\/p>\n<p>We create the mask using the make_local_attention_from_bottomright() method of the BlockDiagonalMask class from xFormers:<\/p>\n<pre>BlockDiagonalMask.from_seqlens(<br>  q_seqlen=[2,0,1],<br>  kv_seqlen=[4,1,3],<br>).make_local_attention_from_bottomright(window_size=3)<\/pre>\n<p>This mask looks\u00a0like:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/405\/1%2Ax8Ec2i0W5cbdJ2edvLj42w.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<p>Similar to the logic explained in Iteration 1, we have three matrices concatenated diagonally, where the rows represent the number of queries and the columns represent the number of keys in each sequence.<\/p>\n<p>Here, we need to use make_local_attention_from_bottomright() instead of make_local_attention(), as we want to start from the bottom right in each\u00a0block.<\/p>\n<h4>Step 2: Cache\u00a0Update<\/h4>\n<p>We store the computed keys and values into the cache similar to iteration 1 in a rolling fashion. Our updated cache then looks like\u00a0this:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AU35Yijr1ZPSB2JUAQ4xsfA.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<h3>Part 2: Decode\u00a0Stage<\/h3>\n<p>After the prefill stage, we move on to the decode stage, where we begin generating our output tokens one at a\u00a0time.<\/p>\n<p>Unlike the prefill stage, where Step 1 involves reading cache entries and computing attention and Step 2 involves updating the cache with the new entries, in the decode stage we reverse these steps. First, we update the cache with the new entries, and then we read all the entries (including the ones we just added) to compute self-attention.<\/p>\n<p>This approach works neatly because decoding happens one token at a time, and we know all entries in the cache are within our context window (of size W) and needed for self-attention.<\/p>\n<h4>Step 1: Cache\u00a0Update<\/h4>\n<p>We compute the key and value vectors for the current input token and add them to the cache. The new tokens are #4, #1 and #3 for the three sequences. The updated cache looks like\u00a0this:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A2sRw0-nHBLYho44orvJYpg.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<h4>Step 2: Retrieve from KV-Cache and Compute Attention<\/h4>\n<p>We now proceed to compute self-attention and the associated mask!<\/p>\n<ol>\n<li>We have one query for each sequence in the batch, so <br \/>q_seqlen= [1, 1,\u00a01].<\/li>\n<li>The number of keys is the number of valid entries in the cache, given by kv_seqlen = [3, 2,\u00a03].<\/li>\n<\/ol>\n<p>In the Mistral codebase, for simplicity, they fix the attention mask shape to (W\u00d7batch_size, W\u00d7batch_size) =\u00a0(9,9)<\/p>\n<p>We create our attention mask again with xFormers like\u00a0so:<\/p>\n<pre>BlockDiagonalCausalWithOffsetPaddedKeysMask.from_seqlens(<br>  q_seqlen=[1,1,1],<br>  kv_padding=3,<br>  kv_seqlen=[3,2,3]<br>)<\/pre>\n<p>This mask looks\u00a0like:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/452\/1%2AzcXRlHuaeMYPgH5cTV4rGQ.png?ssl=1\"><figcaption>Image by\u00a0author<\/figcaption><\/figure>\n<p>We have 3 blocks of 1&#215;3 matrices concatenated diagonally. Since we fixed our attention mask to 9&#215;9 for simplicity, our initial attention score matrix (before applying the mask) considers dot products between all queries in the cache (valid or not) with all keys. This is evident, for example, in sequence 2 above, where we place a 0 in the 3rd entry of the block to invalidate that\u00a0entry.<\/p>\n<p>And that\u2019s a wrap! I hope you found this post both enjoyable and enlightening. Thanks for reading, and I look forward to sharing more of my learnings!<\/p>\n<h3>References<\/h3>\n<ol>\n<li>Mistral Codebase: <a href=\"https:\/\/github.com\/mistralai\/mistral-inference\/tree\/main\">https:\/\/github.com\/mistralai\/mistral-inference\/tree\/main<\/a>\n<\/li>\n<li>xFormers Codebase: <a href=\"https:\/\/github.com\/facebookresearch\/xformers\">https:\/\/github.com\/facebookresearch\/xformers<\/a>\n<\/li>\n<li>Umar Jamil\u2019s excellent overview of Mistral: <a href=\"https:\/\/www.youtube.com\/watch?v=UiX8K-xBUpE\">https:\/\/www.youtube.com\/watch?v=UiX8K-xBUpE<\/a>\n<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=7e0cea8409a1\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/deep-dive-into-kv-caching-in-mistral-7e0cea8409a1\">Deep Dive into KV-Caching In Mistral<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Rohit Ramaprasad<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fdeep-dive-into-kv-caching-in-mistral-7e0cea8409a1\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep Dive into KV-Caching In Mistral Ever wondered why the time to first token in LLMs is high but subsequent tokens are super\u00a0fast? In this post, I dive into the details of KV-Caching used in Mistral, a topic I initially found quite daunting. However, as I delved deeper, it became a fascinating subject, especially when [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,77,1302,71,68,1287],"tags":[1304,246,1303],"class_list":["post-1197","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-genai","category-kv-cache","category-large-language-models","category-mistral","category-transformers","tag-kv","tag-query","tag-token"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1197"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1197"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1197\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1197"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1197"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1197"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}