{"id":3593,"date":"2025-05-06T07:04:47","date_gmt":"2025-05-06T07:04:47","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/06\/diffusion-models-explained-simply\/"},"modified":"2025-05-06T07:04:47","modified_gmt":"2025-05-06T07:04:47","slug":"diffusion-models-explained-simply","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/06\/diffusion-models-explained-simply\/","title":{"rendered":"Diffusion Models, Explained Simply"},"content":{"rendered":"<p>    Diffusion Models, Explained Simply<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1746495094742\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Generative AI<\/strong>\u00a0is one of the most popular terms we hear today. Recently, there has been a surge in generative AI applications involving text, image, audio, and video generation.<\/p>\n<p class=\"wp-block-paragraph\">When it comes to image creation, <a href=\"https:\/\/towardsdatascience.com\/tag\/diffusion\/\" title=\"Diffusion\">Diffusion<\/a> models have emerged as a state-of-the-art technique for content generation. Although they were first introduced in 2015, they have seen significant advancements and now serve as the core mechanism in well-known models such as DALLE, Midjourney, and CLIP.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>The goal of this article is to introduce the core idea behind diffusion models. This foundational understanding will help in grasping more advanced concepts used in complex diffusion variants and in interpreting the role of hyperparameters when training a custom diffusion model.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Diffusion<\/h2>\n<h3 class=\"wp-block-heading\">Analogy from\u00a0physics<\/h3>\n<p class=\"wp-block-paragraph\">Let us imagine a transparent glass of water. What happens if we add a small amount of another liquid with a yellow color, for example? The yellow liquid will gradually and uniformly spread throughout the glass, and the resulting mixture will take on a slightly transparent yellow tint.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/13dZyENWnh_7wPrpzUfiq7g.png?ssl=1\" alt=\"\" class=\"wp-image-603368\"><\/figure>\n<p class=\"wp-block-paragraph\">The described process is known as\u00a0<strong>forward diffusion<\/strong>: we altered the environment\u2019s state by adding a small amount of another liquid. However, would it be just as easy to perform\u00a0<strong>reverse diffusion<\/strong>\u200a\u2014\u200ato return the mixture back to its original state? It turns out that it is not. In the best-case scenario, achieving this would require highly sophisticated mechanisms.<\/p>\n<h3 class=\"wp-block-heading\">Applying the analogy to machine\u00a0learning<\/h3>\n<p class=\"wp-block-paragraph\">Diffusion can also be applied to images. Imagine a high-quality photo of a dog. We can easily transform this image by gradually adding random noise. As a result, the pixel values will change, making the dog in the image less visible or even unrecognizable. This transformation process is known as\u00a0<strong>forward diffusion<\/strong>.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/14zjLpbKsFPUFbUZyzISnkQ.png?ssl=1\" alt=\"\" class=\"wp-image-603373\"><figcaption class=\"wp-element-caption\">Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2209.00796\" rel=\"noreferrer noopener\" target=\"_blank\">Diffusion Models: A Comprehensive Survey of Methods and Applications<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can also consider the inverse operation: given a noisy image, the goal is to reconstruct the original image. This task is much more challenging because\u00a0<strong>there are far fewer highly recognizable image states compared to the vast number of possible noisy variations.<\/strong>\u00a0Using the same physics analogy mentioned earlier, this process is called\u00a0<strong>reverse diffusion<\/strong>.<\/p>\n<h2 class=\"wp-block-heading\">Architecture of diffusion models<\/h2>\n<p class=\"wp-block-paragraph\"><em>To better understand the structure of diffusion models, let us examine both diffusion processes separately.<\/em><\/p>\n<h3 class=\"wp-block-heading\">Forward diffusion<\/h3>\n<p class=\"wp-block-paragraph\">As mentioned earlier, forward diffusion involves progressively adding noise to an image. In practice, however, the process is a bit more nuanced.<\/p>\n<p class=\"wp-block-paragraph\">The most common method involves sampling a random value for each pixel from a\u00a0<strong>Gaussian distribution<\/strong>\u00a0with a mean of 0. This sampled value\u200a\u2014\u200awhich can be either positive or negative\u200a\u2014\u200ais then added to the pixel\u2019s original value. Repeating this operation across all pixels results in a noisy version of the original image.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/11CnoNB4cbrWoQLm7m8T-DQ.png?ssl=1\" alt=\"\" class=\"wp-image-603369\"><figcaption class=\"wp-element-caption\">For each pixel in the image, a random value is sampled from a Gaussian distribution and added to the pixel\u2019s\u00a0value.<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>The chosen Gaussian distribution typically has a relatively small variance, meaning that the sampled values are usually small. As a result, only minor changes are introduced to the image at each step.<\/em><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Forward diffusion is an iterative process in which noise is applied to the image multiple times. With each iteration, the resulting image becomes increasingly dissimilar to the original. After hundreds of iterations\u200a\u2014\u200awhich is common in real diffusion models\u200a\u2014\u200athe image eventually becomes unrecognizable from pure noise.<\/p>\n<h3 class=\"wp-block-heading\">Reverse diffusion<\/h3>\n<p class=\"wp-block-paragraph\">Now you might ask:\u00a0<em>what is the purpose of performing all these forward diffusion transformations<\/em>? The answer is that the images generated at each iteration are used to train a neural network.<\/p>\n<p class=\"wp-block-paragraph\">Specifically, suppose we applied 100 sequential noise transformations during forward diffusion. We can then take the image at each step and train the neural network to reconstruct the image from the previous step. The difference between the predicted and actual images is calculated using a loss function\u200a\u2014\u200afor example,\u00a0<em>Mean Squared Error (MSE)<\/em>, which measures the average pixel-wise difference between the two images.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1ucgx2YRhgSMRAgSPgdDRMQ.png?ssl=1\" alt=\"\" class=\"wp-image-603370\"><figcaption class=\"wp-element-caption\">The goal of the model is to detect the added noise and reconstruct the previous image. The predicted image is then compared to the actual image to calculate the\u00a0loss.<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>This example shows a diffusion model reconstructing the original image. At the same time, diffusion models can be trained to predict the noise added to an image. In that case, to reconstruct the original image, it is sufficient to subtract the predicted noise from the image at the previous iteration.<\/em><\/p>\n<\/blockquote>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>While both of these tasks might seem similar, predicting the added noise is simpler compared to image reconstruction.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Model design<\/h2>\n<p class=\"wp-block-paragraph\">After gaining a basic intuition about the diffusion technique, it is essential to explore several more advanced concepts to better understand diffusion model design.<\/p>\n<h3 class=\"wp-block-heading\">Number of iterations<\/h3>\n<p class=\"wp-block-paragraph\">The number of iterations is one of the key parameters in diffusion models:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>On one hand, using more iterations means that image pairs at adjacent steps will differ less, making the model\u2019s learning task easier. On the other hand, a higher number of iterations increases computational cost.<\/strong><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">While fewer iterations can speed up training, the model may fail to learn smooth transitions between steps, resulting in poor performance.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>Typically, the number of iterations is chosen between 50 and 1000.<\/em><\/p>\n<\/blockquote>\n<h3 class=\"wp-block-heading\">Neural network architecture<\/h3>\n<p class=\"wp-block-paragraph\">Most commonly, the U-Net architecture is used as the backbone in diffusion models. Here are some of the reasons why:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">U-Net preserves the input and output image dimensions, ensuring that the image size remains consistent throughout the reverse diffusion process.<\/li>\n<li class=\"wp-block-list-item\">Its bottleneck architecture enables the reconstruction of the entire image after compression into a latent space. Meanwhile, key image features are retained through skip connections.<\/li>\n<li class=\"wp-block-list-item\">Originally designed for biomedical image segmentation, where pixel-level accuracy is crucial, U-Net\u2019s strengths translate well to diffusion tasks that require precise prediction of individual pixel values.<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1gmDZ4lg5h1LxS1W2USAPjQ.png?ssl=1\" alt=\"\" class=\"wp-image-603372\"><figcaption class=\"wp-element-caption\">U-Net architecture. Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1505.04597\" rel=\"noreferrer noopener\" target=\"_blank\">U-Net: Convolutional Networks for Biomedical Image Segmentation<\/a><\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Shared network<\/h3>\n<p class=\"wp-block-paragraph\">At first glance, it might seem necessary to train a separate neural network for each iteration in the diffusion process. While this approach is feasible and can lead to high-quality inference results, it is highly inefficient from a computational perspective. For example, if the diffusion process consists of a thousand steps, we would need to train a thousand U-Net models\u200a\u2014\u200aan extremely time-consuming and resource-intensive task.<\/p>\n<p class=\"wp-block-paragraph\">However, we can observe that <strong>the task configuration across different iterations is essentially the same<\/strong>: in each case, we need to reconstruct an image of identical dimensions that has been altered with noise of a similar magnitude. This important insight leads to the idea of <strong>using a single, shared neural network across all iterations.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">In practice, this means that we use a single U-Net model with shared weights, trained on image pairs from different diffusion steps. During inference, the noisy image is passed through the same trained U-Net multiple times, gradually refining it until a high-quality image is produced.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1714trJlstckw931BX6ZstQ.png?ssl=1\" alt=\"\" class=\"wp-image-603371\"><figcaption class=\"wp-element-caption\">A single shared model is used for image prediction tasks across all iterations.<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>Though the generation quality might slightly deteriorate due to using only a single model, the gain in training speed becomes highly significant.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">In this article, we explored the core concepts of diffusion models, which play a key role in <a href=\"https:\/\/towardsdatascience.com\/tag\/image-generation\/\" title=\"Image Generation\">Image Generation<\/a>. There are many variations of these models\u200a\u2014\u200aamong them,\u00a0<strong>stable diffusion<\/strong>\u00a0models have become particularly popular. While based on the same fundamental principles, stable diffusion also enables the integration of text or other types of input to guide and constrain the generated images.<\/p>\n<h2 class=\"wp-block-heading\">Resources<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/pdf\/1505.04597\" target=\"_blank\" rel=\"noreferrer noopener\">U-Net: Convolutional Networks for Biomedical Image Segmentation<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/pdf\/2209.00796\" target=\"_blank\" rel=\"noreferrer noopener\">Diffusion Models: A Comprehensive Survey of Methods and Applications<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><em>All images unless otherwise noted are by the author.<\/em><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/diffusion-models-explained-simply\/\">Diffusion Models, Explained Simply<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Vyacheslav Efimov<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/diffusion-models-explained-simply\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Diffusion Models, Explained Simply Introduction Generative AI\u00a0is one of the most popular terms we hear today. Recently, there has been a surge in generative AI applications involving text, image, audio, and video generation. When it comes to image creation, Diffusion models have emerged as a state-of-the-art technique for content generation. Although they were first introduced [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,221,2574,1664,2575,2576],"tags":[454,845,73],"class_list":["post-3593","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-computer-vision","category-diffusion","category-generative-ai","category-image-generation","category-unet","tag-diffusion","tag-image","tag-models"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3593"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3593"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3593\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3593"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3593"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3593"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}