{"id":2098,"date":"2025-02-27T07:02:52","date_gmt":"2025-02-27T07:02:52","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/27\/llada-the-diffusion-model-that-could-redefine-language-generation\/"},"modified":"2025-02-27T07:02:52","modified_gmt":"2025-02-27T07:02:52","slug":"llada-the-diffusion-model-that-could-redefine-language-generation","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/27\/llada-the-diffusion-model-that-could-redefine-language-generation\/","title":{"rendered":"LLaDA: The Diffusion Model That Could Redefine Language Generation"},"content":{"rendered":"<p>    LLaDA: The Diffusion Model That Could Redefine Language Generation<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\" id=\"c233\">Introduction<\/h2>\n<p class=\"wp-block-paragraph\" id=\"5b5a\">What if we could make language models think<strong>\u00a0more like humans<\/strong>? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?<\/p>\n<p class=\"wp-block-paragraph\" id=\"5735\">This is exactly what Large Language <a href=\"https:\/\/towardsdatascience.com\/tag\/diffusion-models\/\" title=\"Diffusion Models\">Diffusion Models<\/a> (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right,\u00a0<strong>LLaDA leverages a diffusion-like process to generate text.<\/strong>\u00a0Instead of generating tokens sequentially, it\u00a0<strong>progressively refines masked text until it forms a coherent response<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"401f\">In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.<\/p>\n<p class=\"wp-block-paragraph\" id=\"6c28\">I hope you enjoy the article!<\/p>\n<h2 class=\"wp-block-heading\" id=\"f47a\">The current state of LLMs<\/h2>\n<p class=\"wp-block-paragraph\" id=\"d02e\">To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Pre-training<\/strong>: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Supervised Fine-Tuning (SFT)<\/strong>: The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\" id=\"0b6d\"><em>Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.<\/em><\/p>\n<p class=\"wp-block-paragraph\" id=\"f93c\">These models, primarily based on the Transformer architecture, generate text\u00a0<strong>one token at a time<\/strong>\u00a0using next-token prediction.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"ecf2f8\" data-has-transparency=\"false\" style=\"--dominant-color: #ecf2f8;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-1024x683.webp?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-598461 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-1024x683.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-300x200.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-768x512.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-1536x1024.webp 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_F6HkX1GkSF5WYOwZ4Afo_w-2048x1366.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Simplified Transformer architecture for text generation (Image by the author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"4a10\">Here is a simplified illustration of how data passes through such a model.\u00a0<strong>Each token is embedded into a vector and is transformed through successive transformer layers<\/strong>. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7436\">This works thanks to the concept of\u00a0<strong>masked self-attention<\/strong>: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f4f4\" data-has-transparency=\"false\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"220\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zVrq7u6Wo-i1pclptiU_xQ.webp?resize=899%2C220&#038;ssl=1\" alt=\"\" class=\"wp-image-598463 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zVrq7u6Wo-i1pclptiU_xQ.webp 899w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zVrq7u6Wo-i1pclptiU_xQ-300x73.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zVrq7u6Wo-i1pclptiU_xQ-768x188.webp 768w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><figcaption class=\"wp-element-caption\">Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"0ad5\"><em>If you want to learn more about Transformers, check out my article <a href=\"https:\/\/towardsdatascience.com\/transformers-how-do-they-transform-your-data-72d69e383e0d\/#:~:text=We%20pass%20our%20data%20through,several%20(num_layers)%20encoder%20layers.\">here<\/a><\/em>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"211c\">While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.<\/p>\n<h2 class=\"wp-block-heading\" id=\"e075\">Current limitations of LLMs<\/h2>\n<p class=\"wp-block-paragraph\" id=\"e4d6\">Current LLMs face several critical challenges:<\/p>\n<h3 class=\"wp-block-heading\" id=\"bff4\">Computational Inefficiency<\/h3>\n<p class=\"wp-block-paragraph\" id=\"a3d3\">Imagine having to write a novel where\u00a0<strong>you can only think about one word at a time,<\/strong>\u00a0and for each word, you need to reread everything you\u2019ve written so far. This is essentially how current LLMs operate \u2014 they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is\u00a0<strong>quite computationally expensive and time-consuming<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\" id=\"2946\">Limited Bidirectional Reasoning<\/h3>\n<p class=\"wp-block-paragraph\" id=\"17ae\">Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they\u2019ve written so far.\u00a0<strong>They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text.<\/strong>\u00a0As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.<\/p>\n<h3 class=\"wp-block-heading\" id=\"1e8f\">Amount of data<\/h3>\n<p class=\"wp-block-paragraph\" id=\"f0bb\">Existing models require\u00a0<strong>enormous amounts of training data<\/strong>\u00a0to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.<\/p>\n<h2 class=\"wp-block-heading\" id=\"fc37\">What is LLaDA<\/h2>\n<p class=\"wp-block-paragraph\" id=\"8cca\">LLaDA introduces a fundamentally different approach to <a href=\"https:\/\/towardsdatascience.com\/tag\/language-generation\/\" title=\"Language Generation\">Language Generation<\/a> by replacing traditional autoregression with a\u00a0<strong>\u201cdiffusion-based\u201d<\/strong>\u00a0process (we will dive later into why this is called \u201cdiffusion\u201d).<\/p>\n<p class=\"wp-block-paragraph\" id=\"9020\">Let\u2019s understand how this works, step by step, starting with pre-training.<\/p>\n<h3 class=\"wp-block-heading\" id=\"be3e\">LLaDA pre-training<\/h3>\n<p class=\"wp-block-paragraph\" id=\"f160\">Remember that we don\u2019t need any \u201clabeled\u201d data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.<\/li>\n<li class=\"wp-block-list-item\">We randomly choose a \u201cmasking rate\u201d. For example, one could pick 40%.<\/li>\n<li class=\"wp-block-list-item\">We mask each token with a probability of 0.4. What does \u201cmasking\u201d mean exactly? Well, we simply replace the token\u00a0<strong>with a special token<\/strong>:\u00a0<strong>&lt;MASK&gt;<\/strong>. As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.<\/li>\n<li class=\"wp-block-list-item\">We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We\u00a0<strong>apply the classification head to each of the masked tokens<\/strong>\u00a0to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:<\/li>\n<\/ol>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f6f6f3\" data-has-transparency=\"false\" style=\"--dominant-color: #f6f6f3;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"655\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-1024x655.webp?resize=1024%2C655&#038;ssl=1\" alt=\"\" class=\"wp-image-598460 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-1024x655.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-300x192.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-768x491.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-1536x982.webp 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1__DFxLzcQ9fskIlIymfuLBA-2048x1309.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Loss function used for LLaDA (Image by the author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"39be\">5. And\u2026 we repeat this procedure for billions or trillions of text sequences.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"e0e2\"><strong><em>Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn\u2019t require masking in attention layers anymore. However, this can come at an increased computational cost.<\/em><\/strong><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\" id=\"9fb6\">Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs.\u00a0<strong>We simply predict randomly masked tokens instead of predicting what comes next.<\/strong><\/p>\n<h3 class=\"wp-block-heading\" id=\"711c\">LLaDA SFT<\/h3>\n<p class=\"wp-block-paragraph\" id=\"dfcd\">For\u00a0<strong>auto-regressive models<\/strong>, SFT is very similar to pre-training, except that we have pairs of\u00a0<em>(prompt, response)\u00a0<\/em>and want to generate the response when giving the prompt as input.<\/p>\n<p class=\"wp-block-paragraph\" id=\"fe6d\">This is exactly the\u00a0<strong>same concept for <a href=\"https:\/\/towardsdatascience.com\/tag\/llada\/\" title=\"LlaDa\">LlaDa<\/a><\/strong>! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens\u00a0<strong>from the response only<\/strong>, and feed the full sequence into the model, which\u00a0<strong>will predict missing tokens from the response<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\" id=\"e0ed\">The innovation in inference<\/h3>\n<p class=\"wp-block-paragraph\" id=\"545f\">Innovation is where LLaDA gets more interesting, and truly utilizes the \u201cdiffusion\u201d paradigm.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c0e2\">Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference,\u00a0<strong>we only have access to the prompt\u00a0<\/strong>and we need to generate the entire response. You might think (and it\u2019s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to\u00a0<strong>generate a full response from a prompt<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8753\">However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to\u00a0<strong>progressively refine predictions<\/strong>, and that\u2019s where the key idea of\u00a0<strong>\u2018remasking\u2019\u00a0<\/strong>comes in.<\/p>\n<p class=\"wp-block-paragraph\" id=\"1c14\">Here is how it works, at each step of the text generation process:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Feed the current input to the model (this is the prompt, followed by\u00a0<strong>&lt;MASK&gt;<\/strong>\u00a0tokens)<\/li>\n<li class=\"wp-block-list-item\">The model generates one embedding for each input token. We get predictions for the\u00a0<strong>&lt;MASK&gt;<\/strong>\u00a0tokens only. And here is the important step:\u00a0<strong>we remask a portion of them<\/strong>. In particular: we only keep the \u201cbest\u201d tokens i.e. the ones\u00a0<strong>with the best predictions<\/strong>, with the highest confidence.<\/li>\n<li class=\"wp-block-list-item\">We can use this\u00a0<strong>partially unmasked sequence<\/strong>\u00a0as input in the next generation step and repeat until all tokens are unmasked.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"ea3f\">You can see that, interestingly,\u00a0<strong>we have much more control<\/strong>\u00a0over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously,\u00a0<strong>there is a trade-off here between the quality of the predictions and inference time.<\/strong><\/p>\n<p class=\"wp-block-paragraph\" id=\"fd80\">Let\u2019s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e8ecda\" data-has-transparency=\"false\" style=\"--dominant-color: #e8ecda;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"565\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-1024x565.webp?resize=1024%2C565&#038;ssl=1\" alt=\"\" class=\"wp-image-598464 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-1024x565.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-300x166.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-768x424.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-1536x848.webp 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Jfxxelqbjz8u-QBLeXZYzA-2048x1130.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">LLaDA generation process example (Image by the author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"67ba\">Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens,\u00a0<strong>we would remask a proportion of s\/t tokens over time, from t=1 down to 0, where s is in [0, t].<\/strong>\u00a0In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3213\"><em>Example<\/em>: if we want N sampling steps (so N discrete steps from t=1 down to t=1\/N with steps of 1\/N), taking s = (t-1\/N) is a good choice, and ensures that s=0 at the end of the process.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ca92\">The image below summarizes the 3 steps described above. \u201cMask predictor\u201d simply denotes the <a href=\"https:\/\/towardsdatascience.com\/tag\/llm\/\" title=\"Llm\">Llm<\/a> (LLaDA), predicting masked tokens.<\/p>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"ecedec\" data-has-transparency=\"false\" style=\"--dominant-color: #ecedec;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"298\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_00SyWDItOWMv0bUdKTye9A-1024x298.webp?resize=1024%2C298&#038;ssl=1\" alt=\"\" class=\"wp-image-598465 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_00SyWDItOWMv0bUdKTye9A-1024x298.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_00SyWDItOWMv0bUdKTye9A-300x87.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_00SyWDItOWMv0bUdKTye9A-768x223.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_00SyWDItOWMv0bUdKTye9A.webp 1251w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\" id=\"a6e5\">Can autoregression and diffusion be combined?<\/h2>\n<p class=\"wp-block-paragraph\" id=\"92f1\">Another clever idea developed in LLaDA is\u00a0<strong>to combine diffusion with traditional autoregressive generation to use the best of both worlds!<\/strong>\u00a0This is called\u00a0<strong>semi-autoregressive diffusion<\/strong>.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Divide the generation process into blocks (for instance, 32 tokens in each block).<\/li>\n<li class=\"wp-block-list-item\">The objective is to\u00a0<strong>generate one block at a time<\/strong>\u00a0(like we would generate one token at a time in ARMs).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>For each block, we apply the diffusion logic<\/strong>\u00a0by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.<\/li>\n<\/ul>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e9e7e4\" data-has-transparency=\"false\" style=\"--dominant-color: #e9e7e4;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"347\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zaZOcBVb5G4sqY60J68ulg-1024x347.webp?resize=1024%2C347&#038;ssl=1\" alt=\"\" class=\"wp-image-598466 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zaZOcBVb5G4sqY60J68ulg-1024x347.webp 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zaZOcBVb5G4sqY60J68ulg-300x102.webp 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zaZOcBVb5G4sqY60J68ulg-768x260.webp 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_zaZOcBVb5G4sqY60J68ulg.webp 1142w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Semi-autoregressive process (source: [1])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"466f\">This is a hybrid approach: we probably lose some of the \u201cbackward\u201d generation and parallelization capabilities of the model, but\u00a0<strong>we better \u201cguide\u201d the model<\/strong>\u00a0towards the final output.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7145\">I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more \u201cguided\u201d generation from left to right (more on that in the last paragraph).<\/p>\n<h2 class=\"wp-block-heading\" id=\"40eb\">Why \u201cDiffusion\u201d?<\/h2>\n<p class=\"wp-block-paragraph\" id=\"3e91\">I think it\u2019s important to briefly explain where this term actually comes from. It reflects a similarity with\u00a0<strong>image diffusion models (like Dall-E)<\/strong>, which have been very popular for image generation tasks.<\/p>\n<p class=\"wp-block-paragraph\" id=\"12b7\">In image diffusion, a model first adds noise to an image until it\u2019s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text\u00a0<strong>by masking tokens instead of adding noise<\/strong>, and then progressively\u00a0<strong>unmasking them<\/strong>\u00a0to generate coherent language. In the context of image generation, the masking step is often called \u201cnoise scheduling\u201d, and the reverse (remasking) is the \u201cdenoising\u201d step.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"b0aba8\" data-has-transparency=\"false\" style=\"--dominant-color: #b0aba8;\" loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"243\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_y1hRHhA3CaJkadBh.webp?resize=700%2C243&#038;ssl=1\" alt=\"\" class=\"wp-image-598457 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_y1hRHhA3CaJkadBh.webp 700w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_y1hRHhA3CaJkadBh-300x104.webp 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\"><figcaption class=\"wp-element-caption\">How do Diffusion Models work? (source: [2])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"9521\">You can also see LLaDA as some type of\u00a0<strong>discrete<\/strong>\u00a0(non-continuous) diffusion model: we don\u2019t add noise to tokens, but we \u201cdeactivate\u201d some tokens by masking them, and the model learns how to unmask a portion of them.<\/p>\n<h2 class=\"wp-block-heading\" id=\"7639\">Results<\/h2>\n<p class=\"wp-block-paragraph\" id=\"4727\">Let\u2019s go through\u00a0<em>a few\u00a0<\/em>of the interesting results of LLaDA.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8efa\"><em>You can find all the results in the paper. I chose to focus on what I find the most interesting here.<\/em><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Training efficiency<\/strong>: LLaDA shows similar performance to ARMs with the same number of parameters, but u<strong>ses much fewer tokens<\/strong>\u00a0during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Using different block and answer lengths for different tasks<\/strong>: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the\u00a0<strong>diffusion-based and backward process<\/strong>.<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0f0f0\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"591\" height=\"397\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_npIPih_7TGBiT6bEGnYfzA.webp?resize=591%2C397&#038;ssl=1\" alt=\"\" class=\"wp-image-598458 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_npIPih_7TGBiT6bEGnYfzA.webp 591w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_npIPih_7TGBiT6bEGnYfzA-300x202.webp 300w\" sizes=\"auto, (max-width: 591px) 100vw, 591px\"><figcaption class=\"wp-element-caption\">Source: [1]<\/figcaption><\/figure>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Interestingly, LLaDA does better on the \u201cReversal poem completion task\u201d. This task requires the model to\u00a0<strong>complete a poem in reverse order<\/strong>, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"e6e6e6\" data-has-transparency=\"false\" style=\"--dominant-color: #e6e6e6;\" loading=\"lazy\" decoding=\"async\" width=\"475\" height=\"161\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_rO3FqBzTLRoXCElD0LxGfw.webp?resize=475%2C161&#038;ssl=1\" alt=\"\" class=\"wp-image-598459 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_rO3FqBzTLRoXCElD0LxGfw.webp 475w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_rO3FqBzTLRoXCElD0LxGfw-300x102.webp 300w\" sizes=\"auto, (max-width: 475px) 100vw, 475px\"><figcaption class=\"wp-element-caption\">Source: [1]<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"91a8\">LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.<\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\" id=\"80ce\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\" id=\"b26b\">I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to\u00a0<strong>more efficient training<\/strong>,\u00a0<strong>better reasoning<\/strong>, and\u00a0<strong>improved context understanding<\/strong>\u00a0with fewer computational resources.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ffab\">Beyond efficiency, I think LLaDA also brings a lot of\u00a0<strong>flexibility<\/strong>. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can\u00a0<strong>better adapt to different tasks and constraints<\/strong>, making it a versatile tool for various language modeling needs, and allowing\u00a0<strong>more human control<\/strong>. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.<\/p>\n<p class=\"wp-block-paragraph\" id=\"81ef\">As research into diffusion-based language models advances, LLaDA could become a useful step toward\u00a0<strong>more natural and efficient language models<\/strong>. While it\u2019s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ca70\">Thanks for reading!<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\" id=\"89f0\">Check out my previous articles:<\/p>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-towards-data-science wp-block-embed-towards-data-science\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"0XRYAeiIZR\"><p><a href=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/\">Training Large Language Models: From TRPO to\u00a0GRPO<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" title=\"\u201cTraining Large Language Models: From TRPO to\u00a0GRPO\u201d \u2014 Towards Data Science\" src=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/embed\/#?secret=h3MTxjqOUF#?secret=0XRYAeiIZR\" data-secret=\"0XRYAeiIZR\" width=\"500\" height=\"282\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div>\n<\/figure>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-towards-data-science wp-block-embed-towards-data-science\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"x4KmLPIqU8\"><p><a href=\"https:\/\/towardsdatascience.com\/the-math-behind-the-curse-of-dimensionality-cf8780307d74\/\">The Math Behind \u201cThe Curse of Dimensionality\u201d<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" title=\"\u201cThe Math Behind \u201cThe Curse of Dimensionality\u201d\u201d \u2014 Towards Data Science\" src=\"https:\/\/towardsdatascience.com\/the-math-behind-the-curse-of-dimensionality-cf8780307d74\/embed\/#?secret=bsMvtXWYqA#?secret=x4KmLPIqU8\" data-secret=\"x4KmLPIqU8\" width=\"500\" height=\"282\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Feel free to connect on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/maxime-wolf\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Follow me on\u00a0<a href=\"https:\/\/github.com\/maxime7770\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>\u00a0for more content<\/li>\n<li class=\"wp-block-list-item\">Visit my website:\u00a0<a href=\"http:\/\/maximewolf.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">maximewolf.com<\/a>\n<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"cd21\">References:<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">[1] Liu, C., Wu, J., Xu, Y., Zhang, Y., Zhu, X., &amp; Song, D. (2024).\u00a0<strong>Large Language Diffusion Models<\/strong>.\u00a0<em>arXiv preprint arXiv:2502.09992<\/em>.\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2502.09992\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2502.09992<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[2]\u00a0<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3626235\" target=\"_blank\" rel=\"noreferrer noopener\">Yang, Ling, et al. \u201cDiffusion models: A comprehensive survey of methods and applications.\u201d ACM Computing Surveys 56.4 (2023): 1\u201339.<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[3] Alammar, J. (2018, June 27).\u00a0<strong>The Illustrated Transformer<\/strong>.\u00a0<em>Jay Alammar\u2019s Blog<\/em>.\u00a0<a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/jalammar.github.io\/illustrated-transformer\/<\/a><a href=\"https:\/\/medium.com\/tag\/machine-learning?source=post_page-----950bcce4ec09---------------------------------------\"><\/a>\n<\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/llada-the-diffusion-model-that-could-redefine-language-generation\/\">LLaDA: The Diffusion Model That Could Redefine Language Generation<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Maxime Wolf<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/llada-the-diffusion-model-that-could-redefine-language-generation\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLaDA: The Diffusion Model That Could Redefine Language Generation Introduction What if we could make language models think\u00a0more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them? This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,67,1663,1866,1867,87,70],"tags":[146,1868,834],"class_list":["post-2098","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-deep-dives","category-diffusion-models","category-language-generation","category-llada","category-llm","category-machine-learning","tag-language","tag-llada","tag-text"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2098"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2098"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2098\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2098"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2098"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2098"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}