{"id":1710,"date":"2025-02-07T07:02:19","date_gmt":"2025-02-07T07:02:19","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/07\/a-visual-guide-to-how-diffusion-models-work\/"},"modified":"2025-02-07T07:02:19","modified_gmt":"2025-02-07T07:02:19","slug":"a-visual-guide-to-how-diffusion-models-work","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/07\/a-visual-guide-to-how-diffusion-models-work\/","title":{"rendered":"A Visual Guide to How Diffusion Models\u00a0Work"},"content":{"rendered":"<p>    A Visual Guide to How Diffusion Models\u00a0Work<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">This article is aimed at those who want to understand exactly how <a href=\"https:\/\/towardsdatascience.com\/tag\/diffusion-models\/\" title=\"Diffusion Models\">Diffusion Models<\/a> work, with no prior knowledge expected. I\u2019ve tried to use illustrations wherever possible to provide visual intuitions on each part of these models. I\u2019ve kept mathematical notation and equations to a minimum, and where they are necessary I\u2019ve tried to define and explain them as they occur.<\/p>\n<h3 class=\"wp-block-heading\">Intro<\/h3>\n<p class=\"wp-block-paragraph\">I\u2019ve framed this article around three main questions:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">What exactly is it that diffusion models learn?<\/li>\n<li class=\"wp-block-list-item\">How and why do diffusion models work?<\/li>\n<li class=\"wp-block-list-item\">Once you\u2019ve trained a model, how do you get useful stuff out of it?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The examples will be based on the <a href=\"https:\/\/yue-here.com\/posts\/glyffuser\/\" rel=\"noreferrer noopener\" target=\"_blank\">glyffuser<\/a>, a minimal text-to-image diffusion model that I previously <a href=\"https:\/\/yue-here.com\/posts\/glyffuser\/\" rel=\"noreferrer noopener\" target=\"_blank\">implemented and wrote about<\/a>. The architecture of this model is a standard text-to-image denoising diffusion model without any bells or whistles. It was trained to generate pictures of new \u201cChinese\u201d glyphs from English definitions. Have a look at the picture below\u200a\u2014\u200aeven if you\u2019re not familiar with Chinese writing, I hope you\u2019ll agree that the generated glyphs look pretty similar to the real ones!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"d0d0d0\" data-has-transparency=\"false\" style=\"--dominant-color: #d0d0d0;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"557\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_vJlBMS83Bb0b3JeEbKLo3A-1024x557.png?resize=1024%2C557&#038;ssl=1\" alt=\"\" class=\"wp-image-597461 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_vJlBMS83Bb0b3JeEbKLo3A-1024x557.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_vJlBMS83Bb0b3JeEbKLo3A-300x163.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_vJlBMS83Bb0b3JeEbKLo3A-768x417.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_vJlBMS83Bb0b3JeEbKLo3A.png 1034w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Random examples of glyffuser training data (left) and generated data (right).<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">What exactly is it that diffusion models\u00a0learn?<\/h3>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/tag\/generative-ai\/\" title=\"Generative Ai\">Generative Ai<\/a> models are often said to take a big pile of data and \u201clearn\u201d it. For text-to-image diffusion models, the data takes the form of pairs of images and descriptive text. But what exactly is it that we want the model to learn? First, let\u2019s forget about the text for a moment and concentrate on what we are trying to generate: the images.<\/p>\n<h4 class=\"wp-block-heading\">Probability distributions<\/h4>\n<p class=\"wp-block-paragraph\">Broadly, we can say that we want a generative AI model to learn the <em>underlying probability distribution<\/em> of the data. What does this mean? Consider the one-dimensional normal (Gaussian) distribution below, commonly written <strong>\ud835\udca9<\/strong>(<em>\u03bc<\/em>,<em>\u03c3<\/em><strong>\u00b2<\/strong>) and <em>parameterized<\/em> with mean <em>\u03bc <\/em>= 0 and variance <em>\u03c3<\/em><strong>\u00b2<\/strong> = 1. The black curve below shows the probability density function. We can <em>sample<\/em> from it: drawing values such that over a large number of samples, the set of values reflects the underlying distribution. These days, we can simply write something like <code>x = random.gauss(0, 1)<\/code> in Python to sample from the standard normal distribution, although the computational sampling process itself is non-trivial!<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f3e4d1\" data-has-transparency=\"true\" style=\"--dominant-color: #f3e4d1;\" loading=\"lazy\" decoding=\"async\" width=\"583\" height=\"217\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_xmmfBzgyuFwo2y9Wz1tn0Q.png?resize=583%2C217&#038;ssl=1\" alt=\"\" class=\"wp-image-597462 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_xmmfBzgyuFwo2y9Wz1tn0Q.png 583w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_xmmfBzgyuFwo2y9Wz1tn0Q-300x112.png 300w\" sizes=\"auto, (max-width: 583px) 100vw, 583px\"><figcaption class=\"wp-element-caption\">Values sampled from an underlying distribution (here, the standard normal <strong>\ud835\udca9<\/strong>(0,1)) can then be used to estimate the parameters of that distribution.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We could think of a set of numbers sampled from the above normal distribution as a simple dataset, like that shown as the orange histogram above. In this particular case, we can calculate the parameters of the underlying distribution using <em>maximum likelihood estimation<\/em>, i.e. by working out the mean and variance. The normal distribution estimated from the samples is shown by the dotted line above. To take some liberties with terminology, you might consider this as a simple example of \u201clearning\u201d an underlying probability distribution. We can also say that here we <em>explicitly<\/em> learnt the distribution, in contrast with the <em>implicit<\/em> methods that diffusion models use.<\/p>\n<p class=\"wp-block-paragraph\">Conceptually, this is all that generative AI is doing\u200a\u2014\u200alearning a distribution, then sampling from that distribution!<\/p>\n<h4 class=\"wp-block-heading\">Data representations<\/h4>\n<p class=\"wp-block-paragraph\">What, then, does the underlying probability distribution of a more complex dataset look like, such as that of the image dataset we want to use to train our diffusion model?<\/p>\n<p class=\"wp-block-paragraph\">First, we need to know what the <em>representation<\/em> of the data is. Generally, a machine learning (ML) model requires data inputs with a consistent representation, i.e. format. For the example above, it was simply numbers (scalars). For images, this representation is commonly a fixed-length vector.<\/p>\n<p class=\"wp-block-paragraph\">The image dataset used for the glyffuser model is ~21,000 pictures of Chinese glyphs. The images are all the same size, 128 \u00d7 128 = 16384 pixels, and greyscale (single-channel color). Thus an obvious choice for the representation is a vector <strong>x<\/strong> of length 16384, where each element corresponds to the color of one pixel: <strong>x <\/strong>= (<em>x<\/em><strong>\u2081<\/strong>,<em>x<\/em>\u2082,\u2026,<em>x<\/em><strong>\u2081\u2086\u2083\u2088\u2084<\/strong>). We can call the domain of all possible images for our dataset \u201cpixel space\u201d.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"bebebe\" data-has-transparency=\"false\" style=\"--dominant-color: #bebebe;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg-1024x1024.png?resize=1024%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-597463 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg-1024x1024.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg-300x300.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg-150x150.png 150w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg-768x768.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OUpCAiPGlP0ZPSD4SEHAlg.png 1252w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">An example glyph with pixel values labelled (downsampled to 32 \u00d7 32 pixels for readability).<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Dataset visualization<\/h3>\n<p class=\"wp-block-paragraph\">We make the assumption that our individual data samples, <em>x<\/em>, are actually sampled from an underlying probability distribution, <em>q<\/em>(<em>x<\/em>), in pixel space, much as the samples from our first example were sampled from an underlying normal distribution in 1-dimensional space. Note: the notation <em>x <\/em>\u223c <em>q<\/em>(<em>x<\/em>) is commonly used to mean: \u201cthe random variable <em>x<\/em> sampled from the probability distribution <em>q<\/em>(<em>x<\/em>).\u201d<\/p>\n<p class=\"wp-block-paragraph\">This distribution is clearly much more complex than a Gaussian and cannot be easily parameterized\u200a\u2014\u200awe need to learn it with a ML model, which we\u2019ll discuss later. First, let\u2019s try to visualize the distribution to gain a better intution.<\/p>\n<p class=\"wp-block-paragraph\">As humans find it difficult to see in more than 3 dimensions, we need to reduce the dimensionality of our data. A small digression on why this works: the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Manifold_hypothesis\" rel=\"noreferrer noopener\" target=\"_blank\">manifold hypothesis<\/a> posits that natural datasets lie on lower dimensional manifolds embedded in a higher dimensional space\u200a\u2014\u200athink of a line embedded in a 2-D plane, or a plane embedded in 3-D space. We can use a dimensionality reduction technique such as <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\" rel=\"noreferrer noopener\" target=\"_blank\">UMAP<\/a> to project our dataset from 16384 to 2 dimensions. The 2-D projection retains a lot of structure, consistent with the idea that our data lie on a lower dimensional manifold embedded in pixel space. In our UMAP, we see two large clusters corresponding to characters in which the components are arranged either horizontally (e.g. \u660e) or vertically (e.g. \u8349). An interactive version of the plot below with popups on each datapoint is linked <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#dataset-visualization\" rel=\"noreferrer noopener\" target=\"_blank\">here<\/a>.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f2f4fc\" data-has-transparency=\"true\" style=\"--dominant-color: #f2f4fc;\" loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"450\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_oWvKkHUWoy1alT50ugt8Lg.png?resize=720%2C450&#038;ssl=1\" alt=\"\" class=\"wp-image-597464 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_oWvKkHUWoy1alT50ugt8Lg.png 720w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_oWvKkHUWoy1alT50ugt8Lg-300x188.png 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\"><figcaption class=\"wp-element-caption\">\u00a0<a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#dataset-visualization\" target=\"_blank\" rel=\"noreferrer noopener\">Click here for an interactive version of this plot.<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s now use this low-dimensional UMAP dataset as a visual shorthand for our high-dimensional dataset. Remember, we assume that these individual points have been sampled from a continuous underlying probability distribution <em>q<\/em>(<em>x<\/em>). To get a sense of what this distribution might look like, we can apply a KDE (kernel density estimation) over the UMAP dataset. (Note: this is just an approximation for visualization purposes.)<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"d9dae6\" data-has-transparency=\"true\" style=\"--dominant-color: #d9dae6;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"429\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w-1024x429.png?resize=1024%2C429&#038;ssl=1\" alt=\"\" class=\"wp-image-597465 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w-1024x429.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w-300x126.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w-768x322.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w-1536x643.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_dYvWE6ExaKSeo_YWU-v97w.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><\/figure>\n<p class=\"wp-block-paragraph\">This gives a sense of what <em>q<\/em>(<em>x<\/em>) should look like: clusters of glyphs correspond to high-probability regions of the distribution. The true <em>q<\/em>(<em>x<\/em>) lies in 16384 dimensions\u200a\u2014\u200athis is the distribution we want to learn with our diffusion model.<\/p>\n<p class=\"wp-block-paragraph\">We showed that for a simple distribution such as the 1-D Gaussian, we could calculate the parameters (mean and variance) from our data. However, for complex distributions such as images, we need to call on ML methods. Moreover, what we will find is that for diffusion models in practice, rather than parameterizing the distribution directly, they learn it <em>implicitly<\/em> through the process of learning how to transform noise into data over many steps.<\/p>\n<h4 class=\"wp-block-heading\">Takeaway<\/h4>\n<p class=\"wp-block-paragraph\">The aim of generative AI such as diffusion models is to learn the complex probability distributions underlying their training data and then sample from these distributions.<\/p>\n<h3 class=\"wp-block-heading\">How and why do diffusion models\u00a0work?<\/h3>\n<p class=\"wp-block-paragraph\">Diffusion models have recently come into the spotlight as a particularly effective method for learning these probability distributions. They generate convincing images by starting from pure noise and gradually refining it. To whet your interest, have a look at the animation below that shows the denoising process generating 16 samples.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"6f6f6f\" data-has-transparency=\"false\" style=\"--dominant-color: #6f6f6f;\" loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"552\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_V07XdbZNdDA0IL-IfGHgPQ.gif?resize=522%2C552&#038;ssl=1\" alt=\"\" class=\"wp-image-597466 not-transparent\"><\/figure>\n<p class=\"wp-block-paragraph\">In this section we\u2019ll only talk about the mechanics of how these models work but if you\u2019re interested in how they arose from the broader context of generative models, have a look at the <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#further-reading\" rel=\"noreferrer noopener\" target=\"_blank\">further reading<\/a> section below.<\/p>\n<h4 class=\"wp-block-heading\">What is\u00a0\u201cnoise\u201d?<\/h4>\n<p class=\"wp-block-paragraph\">Let\u2019s first precisely define noise, since the term is thrown around a lot in the context of diffusion. In particular, we are talking about Gaussian noise: consider the samples we talked about in the section about <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#probability-distributions\" rel=\"noreferrer noopener\" target=\"_blank\">probability distributions<\/a>. You could think of each sample as an image of a single pixel of noise. An image that is \u201cpure Gaussian noise\u201d, then, is one in which each pixel value is sampled from an independent standard Gaussian distribution, <strong>\ud835\udca9<\/strong>(0,1). For a pure noise image in the domain of our glyph dataset, this would be noise drawn from 16384 separate Gaussian distributions. You can see this in the previous animation. One thing to keep in mind is that we can choose the means of these noise distributions, i.e. <em>center<\/em> them, on specific values\u200a\u2014\u200athe pixel values of an image, for instance.<\/p>\n<p class=\"wp-block-paragraph\">For convenience, you\u2019ll often find the noise distributions for image datasets written as a single multivariate distribution <strong>\ud835\udca9<\/strong>(0,<strong><em>I<\/em><\/strong>) where <strong><em>I<\/em><\/strong> is the identity matrix, a covariance matrix with all diagonal entries equal to 1 and zeroes elsewhere. This is simply a compact notation for a set of multiple independent Gaussians\u200a\u2014\u200ai.e. there are no correlations between the noise on different pixels. In the basic implementations of diffusion models, only uncorrelated (a.k.a. \u201cisotropic\u201d) noise is used. <a href=\"https:\/\/distill.pub\/2019\/visual-exploration-gaussian-processes\/\" rel=\"noreferrer noopener\" target=\"_blank\">This article<\/a> contains an excellent interactive introduction on multivariate Gaussians.<\/p>\n<h4 class=\"wp-block-heading\">Diffusion process\u00a0overview<\/h4>\n<p class=\"wp-block-paragraph\">Below is an adaptation of the somewhat-famous diagram from <a href=\"https:\/\/arxiv.org\/abs\/2006.11239\" rel=\"noreferrer noopener\" target=\"_blank\">Ho <em>et al<\/em>.<\/a>\u2019s seminal paper \u201c<em>Denoising Diffusion Probabilistic Models<\/em>\u201d which gives an overview of the whole diffusion process:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"ececec\" data-has-transparency=\"true\" style=\"--dominant-color: #ececec;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"381\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ-1024x381.png?resize=1024%2C381&#038;ssl=1\" alt=\"\" class=\"wp-image-597467 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ-1024x381.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ-300x112.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ-768x286.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ-1536x571.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_ACv-2kbERa3pZHO1p8OXVQ.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Diagram of the diffusion process adapted from <a href=\"https:\/\/arxiv.org\/abs\/2006.11239\" target=\"_blank\" rel=\"noreferrer noopener\">Ho <em>et al<\/em>. 2020<\/a>. The glyph \u9502, meaning \u201clithium\u201d, is used as a representative sample from the dataset.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I found that there was a lot to unpack in this diagram and simply understanding what each component meant was very helpful, so let\u2019s go through it and define everything step by step.<\/p>\n<p class=\"wp-block-paragraph\">We previously used <em>x <\/em>\u223c <em>q<\/em>(<em>x<\/em>) to refer to our data. Here, we\u2019ve added a subscript, <em>x<\/em>\u209c, to denote timestep <em>t<\/em> indicating how many steps of \u201cnoising\u201d have taken place. We refer to the samples noised a given timestep as <em>x <\/em>\u223c <em>q<\/em>(<em>x<\/em>\u209c). <em>x<\/em>\u2080\u200b is clean data and <em>x<\/em>\u209c (<em>t<\/em> = <em>T<\/em>) \u223c <strong>\ud835\udca9<\/strong>(0,1) is pure noise.<\/p>\n<p class=\"wp-block-paragraph\">We define a <em>forward diffusion<\/em> process whereby we corrupt samples with noise. This process is described by the distribution <em>q<\/em>(<em>x<\/em>\u209c|<em>x<\/em>\u209c\u208b\u2081). If we could access the hypothetical reverse process <em>q<\/em>(<em>x<\/em>\u209c\u208b\u2081|<em>x<\/em>\u209c), we could generate samples from noise. As we cannot access it directly because we would need to know <em>x<\/em>\u2080\u200b, we use ML to learn the parameters, <em>\u03b8<\/em>, of a model of this process, \ud835\udc5d<em>\u03b8<\/em>(\ud835\udc65\u209c\u208b\u2081\u2223\ud835\udc65\u209c). (That should be <em>p<\/em> subscript <em>\u03b8 <\/em>but medium cannot render it.)<\/p>\n<p class=\"wp-block-paragraph\">In the following sections we go into detail on how the forward and reverse diffusion processes work.<\/p>\n<h4 class=\"wp-block-heading\">Forward diffusion, or \u201cnoising\u201d<\/h4>\n<p class=\"wp-block-paragraph\">Used as a verb, \u201cnoising\u201d an image refers to applying a transformation that moves it towards pure noise by scaling down its pixel values toward 0 while adding proportional Gaussian noise. Mathematically, this transformation is a multivariate Gaussian distribution centered on the pixel values of the preceding image.<\/p>\n<p class=\"wp-block-paragraph\">In the forward diffusion process, this noising distribution is written as <em>q<\/em>(<em>x<\/em>\u209c|<em>x<\/em>\u209c\u208b\u2081) where the vertical bar symbol \u201c|\u201d is read as \u201cgiven\u201d or \u201cconditional on\u201d, to indicate the pixel means are passed forward from <em>q<\/em>(<em>x<\/em>\u209c\u208b\u2081) At <em>t<\/em> = <em>T<\/em> where <em>T<\/em> is a large number (commonly 1000) we aim to end up with images of pure noise (which, somewhat confusingly, is also a Gaussian distribution, as discussed <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#what-is-noise\" rel=\"noreferrer noopener\" target=\"_blank\">previously<\/a>).<\/p>\n<p class=\"wp-block-paragraph\">The <em>marginal<\/em> distributions <em>q<\/em>(<em>x<\/em>\u209c) represent the distributions that have accumulated the effects of all the previous noising steps (<em>marginalization<\/em> refers to integration over all possible conditions, which recovers the unconditioned distribution).<\/p>\n<p class=\"wp-block-paragraph\">Since the conditional distributions are Gaussian, what about their variances? They are determined by a <em>variance schedule<\/em> that maps timesteps to variance values. Initially, an empirically determined schedule of linearly increasing values from 0.0001 to 0.02 over 1000 steps was presented in <a href=\"https:\/\/arxiv.org\/abs\/2006.11239\" rel=\"noreferrer noopener\" target=\"_blank\">Ho <em>et al<\/em>.<\/a> Later research by <a href=\"https:\/\/arxiv.org\/pdf\/2102.09672\" rel=\"noreferrer noopener\" target=\"_blank\">Nichol &amp; Dhariwal<\/a> suggested an improved cosine schedule. They state that a schedule is most effective when the rate of information destruction through noising is relatively even per step throughout the whole noising process.<\/p>\n<h4 class=\"wp-block-heading\">Forward diffusion intuition<\/h4>\n<p class=\"wp-block-paragraph\">As we encounter Gaussian distributions both as pure noise <em>q<\/em>(<em>x<\/em>\u209c, <em>t<\/em> = <em>T<\/em>) and as the noising distribution <em>q<\/em>(<em>x<\/em>\u209c|<em>x<\/em>\u209c\u208b\u2081), I\u2019ll try to draw the distinction by giving a visual intuition of the distribution for a single noising step, <em>q<\/em>(<em>x<\/em>\u2081\u2223<em>x<\/em>\u2080), for some arbitrary, structured 2-dimensional data:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f6f7f8\" data-has-transparency=\"true\" style=\"--dominant-color: #f6f7f8;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"508\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PcgCmi3HU44EC_osMkEuMA-1024x508.png?resize=1024%2C508&#038;ssl=1\" alt=\"\" class=\"wp-image-597468 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PcgCmi3HU44EC_osMkEuMA-1024x508.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PcgCmi3HU44EC_osMkEuMA-300x149.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PcgCmi3HU44EC_osMkEuMA-768x381.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PcgCmi3HU44EC_osMkEuMA.png 1189w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Each noising step <em>q<\/em>(<em>x<\/em>\u209c|<em>x<\/em>\u209c\u208b\u2081) is a Gaussian distribution conditioned on the previous step.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The distribution <em>q<\/em>(<em>x<\/em>\u2081\u2223<em>x<\/em>\u2080) is Gaussian, centered around each point in <em>x<\/em>\u2080, shown in blue. Several example points <em>x<\/em>\u2080\u207d\u2071\u207e are picked to illustrate this, with <em>q<\/em>(<em>x<\/em>\u2081\u2223<em>x<\/em>\u2080 = <em>x<\/em>\u2080\u207d\u2071\u207e) shown in orange.<\/p>\n<p class=\"wp-block-paragraph\">In practice, the main usage of these distributions is to generate specific instances of noised samples for training (discussed further below). We can calculate the parameters of the noising distributions at any timestep <em>t<\/em> directly from the variance schedule, as the chain of Gaussians is itself also Gaussian. This is very convenient, as we don\u2019t need to perform noising sequentially\u2014for any given starting data <em>x<\/em>\u2080\u207d\u2071\u207e, we can calculate the noised sample <em>x<\/em>\u209c\u207d\u2071\u207e by sampling from <em>q<\/em>(<em>x<\/em>\u209c\u2223<em>x<\/em>\u2080 = <em>x<\/em>\u2080\u207d\u2071\u207e) directly.<\/p>\n<h4 class=\"wp-block-heading\">Forward diffusion visualization<\/h4>\n<p class=\"wp-block-paragraph\">Let\u2019s now return to our glyph dataset (once again using the UMAP visualization as a visual shorthand). The top row of the figure below shows our dataset sampled from distributions noised to various timesteps: <em>x<\/em>\u209c \u223c <em>q<\/em>(<em>x<\/em>\u209c). As we increase the number of noising steps, you can see that the dataset begins to resemble pure Gaussian noise. The bottom row visualizes the underlying probability distribution <em>q<\/em>(<em>x<\/em>\u209c).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e2e7ee\" data-has-transparency=\"true\" style=\"--dominant-color: #e2e7ee;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"394\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ-1024x394.png?resize=1024%2C394&#038;ssl=1\" alt=\"\" class=\"wp-image-597469 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ-1024x394.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ-300x115.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ-768x295.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ-1536x590.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_Hx4xD8VSNvfKx6nkr-xxtQ.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">The dataset <em>x<\/em>\u209c (above) sampled from its probability distribution <em>q<\/em>(<em>x<\/em>\u209c) (below) at different noising timesteps.<\/figcaption><\/figure>\n<h4 class=\"wp-block-heading\">Reverse diffusion overview<\/h4>\n<p class=\"wp-block-paragraph\">It follows that if we knew the reverse distributions <em>q<\/em>(<em>x<\/em>\u209c\u208b\u2081\u2223<em>x<\/em>\u209c), we could repeatedly subtract a small amount of noise, starting from a pure noise sample <em>x<\/em>\u209c at <em>t <\/em>=<em> T<\/em> to arrive at a data sample <em>x<\/em>\u2080 \u223c <em>q<\/em>(<em>x<\/em>\u2080). In practice, however, we cannot access these distributions without knowing <em>x<\/em>\u2080 beforehand. Intuitively, it\u2019s easy to make a known image much noisier, but given a very noisy image, it\u2019s much harder to guess what the original image was.<\/p>\n<p class=\"wp-block-paragraph\">So what are we to do? Since we have a large amount of data, we can train an ML model to accurately guess the original image that any given noisy image came from. Specifically, we learn the parameters <em>\u03b8<\/em> of an ML model that approximates the reverse noising distributions, <em>p\u03b8<\/em>(<em>x<\/em>\u209c\u208b\u2081 \u2223 <em>x<\/em>\u209c) for <em>t<\/em> = 0,\u00a0\u2026, <em>T<\/em>. In practice, this is embodied in a single <em>noise prediction model<\/em> trained over many different samples and timesteps. This allows it to denoise any given input, as shown in the figure below.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e7d6cc\" data-has-transparency=\"false\" style=\"--dominant-color: #e7d6cc;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"221\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ-1024x221.png?resize=1024%2C221&#038;ssl=1\" alt=\"\" class=\"wp-image-597470 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ-1024x221.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ-300x65.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ-768x166.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ-1536x332.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_5jbkLcVGGY-GcBN44WRlYQ.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">The ML model predicts added noise at any given timestep t.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Next, let\u2019s go over how this noise prediction model is implemented and trained in practice.<\/p>\n<h4 class=\"wp-block-heading\">How the model is implemented<\/h4>\n<p class=\"wp-block-paragraph\">First, we define the ML model\u200a\u2014\u200agenerally a deep neural network of some sort\u200a\u2014\u200athat will act as our noise prediction model. This is what does the heavy lifting! In practice, any ML model that inputs and outputs data of the correct size can be used; the <a href=\"https:\/\/arxiv.org\/abs\/1505.04597\" rel=\"noreferrer noopener\" target=\"_blank\">U-net<\/a>, an architecture particularly suited to learning images, is what we use here and frequently chosen in practice. More recent models also use <a href=\"https:\/\/arxiv.org\/pdf\/2212.09748\" rel=\"noreferrer noopener\" target=\"_blank\"><em>vision transformers<\/em><\/a>.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"ddd6d3\" data-has-transparency=\"true\" style=\"--dominant-color: #ddd6d3;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"312\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA-1024x312.png?resize=1024%2C312&#038;ssl=1\" alt=\"\" class=\"wp-image-597471 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA-1024x312.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA-300x91.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA-768x234.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA-1536x468.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_PP6y1_sfgU-r8h4UdfHnOA.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">We use the U-net architecture (<a href=\"https:\/\/arxiv.org\/abs\/1505.04597\" target=\"_blank\" rel=\"noreferrer noopener\">Ronneberger <em>et al<\/em>. 2015<\/a>) for our ML noise prediction model. We train the model by minimizing the difference between predicted and actual noise.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Then we run the training loop depicted in the figure above:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We take a random image from our dataset and noise it to a random timestep tt. (In practice, we speed things up by doing many examples in parallel!)<\/li>\n<li class=\"wp-block-list-item\">We feed the noised image into the ML model and train it to predict the (known to us) noise in the image. We also perform <em>timestep conditioning<\/em> by feeding the model a <em>timestep embedding<\/em>, a high-dimensional unique representation of the timestep, so that the model can distinguish between timesteps. This can be a vector the same size as our image directly added to the input (see <a href=\"https:\/\/www.assemblyai.com\/blog\/how-imagen-actually-works\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> for a discussion of how this is implemented).<\/li>\n<li class=\"wp-block-list-item\">The model \u201clearns\u201d by minimizing the value of a <em>loss function<\/em>, some measure of the difference between the predicted and actual noise. The mean square error (the mean of the squares of the pixel-wise difference between the predicted and actual noise) is used in our case.<\/li>\n<li class=\"wp-block-list-item\">Repeat until the model is well trained.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Note: A neural network is essentially a function with a huge number of parameters (on the order of 10<strong>\u2076 <\/strong>for the glyffuser). Neural network ML models are trained by iteratively updating their parameters using <em>backpropagation<\/em> to minimize a given loss function over many training data examples. <a href=\"https:\/\/www.3blue1brown.com\/topics\/neural-networks\" rel=\"noreferrer noopener\" target=\"_blank\">This<\/a> is an excellent introduction. These parameters effectively store the network\u2019s \u201cknowledge\u201d.<\/p>\n<p class=\"wp-block-paragraph\">A noise prediction model trained in this way eventually sees many different combinations of timesteps and data examples. The glyffuser, for example, was trained over 100 <em>epochs<\/em> (runs through the whole data set), so it saw around 2 million data samples. Through this process, the model implicity learns the reverse diffusion distributions over the entire dataset at all different timesteps. This allows the model to sample the underlying distribution <em>q<\/em>(<em>x<\/em>\u2080) by stepwise denoising starting from pure noise. Put another way, given an image noised to any given level, the model can predict how to reduce the noise based on its guess of what the original image. By doing this repeatedly, updating its guess of the original image each time, the model can transform any noise to a sample that lies in a high-probability region of the underlying data distribution.<\/p>\n<h4 class=\"wp-block-heading\">Reverse diffusion in\u00a0practice<\/h4>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"6f6f6f\" data-has-transparency=\"false\" style=\"--dominant-color: #6f6f6f;\" loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"552\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_V07XdbZNdDA0IL-IfGHgPQ-1.gif?resize=522%2C552&#038;ssl=1\" alt=\"\" class=\"wp-image-597472 not-transparent\"><\/figure>\n<p class=\"wp-block-paragraph\">We can now revisit this video of the glyffuser denoising process. Recall a large number of steps from sample to noise e.g. <em>T<\/em> = 1000 is used during training to make the noise-to-sample trajectory very easy for the model to learn, as changes between steps will be small. Does that mean we need to run 1000 denoising steps every time we want to generate a sample?<\/p>\n<p class=\"wp-block-paragraph\">Luckily, this is not the case. Essentially, we can run the single-step noise prediction but then rescale it to any given step, although it might not be very good if the gap is too large! This allows us to approximate the full sampling trajectory with fewer steps. The video above uses 120 steps, for instance (most implementations will allow the user to set the number of sampling steps).<\/p>\n<p class=\"wp-block-paragraph\">Recall that predicting the noise at a given step is equivalent to predicting the original image <em>x<\/em>\u2080, and that we can access the equation for any noised image deterministically using only the variance schedule and <em>x<\/em>\u2080. Thus, we can calculate <em>x<\/em>\u209c\u208b\u2096 based on any denoising step. The closer the steps are, the better the approximation will be.<\/p>\n<p class=\"wp-block-paragraph\">Too few steps, however, and the results become worse as the steps become too large for the model to effectively approximate the denoising trajectory. If we only use 5 sampling steps, for example, the sampled characters don\u2019t look very convincing at all:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"787978\" data-has-transparency=\"false\" style=\"--dominant-color: #787978;\" loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"552\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_TmF7SZlxFhNliilElvqOtw.gif?resize=522%2C552&#038;ssl=1\" alt=\"\" class=\"wp-image-597473 not-transparent\"><\/figure>\n<p class=\"wp-block-paragraph\">There is then a whole literature on more advanced sampling methods beyond what we\u2019ve discussed so far, allowing effective sampling with much fewer steps. These often reframe the sampling as a differential equation to be solved deterministically, giving an eerie quality to the sampling videos\u200a\u2014\u200aI\u2019ve included one at the <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#fun-extras\" rel=\"noreferrer noopener\" target=\"_blank\">end<\/a> if you\u2019re interested. In production-level models, these are usually preferred over the simple method discussed here, but the basic principle of deducing the noise-to-sample trajectory is the same. A full discussion is beyond the scope of this article but see e.g. <a href=\"https:\/\/arxiv.org\/abs\/2206.00364\" rel=\"noreferrer noopener\" target=\"_blank\">this paper<\/a> and its corresponding <a href=\"https:\/\/huggingface.co\/docs\/diffusers\/en\/api\/schedulers\/overview\" rel=\"noreferrer noopener\" target=\"_blank\">implementation<\/a> in the Hugging Face <code>diffusers<\/code> library for more information.<\/p>\n<h4 class=\"wp-block-heading\">Alternative intuition from score\u00a0function<\/h4>\n<p class=\"wp-block-paragraph\">To me, it was still not 100% clear why training the model on noise prediction generalises so well. I found that an alternative interpretation of diffusion models known as \u201cscore-based modeling\u201d filled some of the gaps in intuition (for more information, refer to Yang Song\u2019s <a href=\"https:\/\/yang-song.net\/blog\/2021\/score\/\" rel=\"noreferrer noopener\" target=\"_blank\">definitive article<\/a> on the topic.)<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e6e8ed\" data-has-transparency=\"true\" style=\"--dominant-color: #e6e8ed;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"558\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA-1024x558.png?resize=1024%2C558&#038;ssl=1\" alt=\"\" class=\"wp-image-597474 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA-1024x558.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA-300x164.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA-768x419.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA-1536x837.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_958_vRbvMlDvEHgMo4jGfA.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">The dataset <em>x<\/em>\u209c sampled from its probability distribution <em>q<\/em>(<em>x<\/em>\u209c) at different noising timesteps; below, we add the score function \u2207\u2093 log <em>q<\/em>(<em>x<\/em>\u209c).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I try to give a visual intuition in the bottom row of the figure above: essentially, learning the noise in our diffusion model is <a href=\"https:\/\/calvinyluo.com\/2022\/08\/26\/diffusion-tutorial.html\" rel=\"noreferrer noopener\" target=\"_blank\">equivalent<\/a> (to a constant factor) to learning the <em>score function<\/em>, which is the gradient of the log of the probability distribution: \u2207\u2093 log <em>q<\/em>(<em>x<\/em>). As a gradient, the score function represents a vector field with vectors pointing towards the regions of highest probability density. Subtracting the noise at each step is then equivalent to moving following the directions in this vector field towards regions of high probability density.<\/p>\n<p class=\"wp-block-paragraph\">As long as there is some signal, the score function effectively guides sampling, but in regions of low probability it tends towards zero as there is little to no gradient to follow. Using many steps to cover different noise levels allows us to avoid this, as we smear out the gradient field at high noise levels, allowing sampling to converge even if we start from low probability density regions of the distribution. The figure shows that as the noise level is increased, more of the domain is covered by the score function vector field.<\/p>\n<h4 class=\"wp-block-heading\">Summary<\/h4>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The aim of diffusion models is learn the underlying probability distribution of a dataset and then be able to sample from it. This requires forward and reverse diffusion (noising) processes.<\/li>\n<li class=\"wp-block-list-item\">The forward noising process takes samples from our dataset and gradually adds Gaussian noise (pushes them off the data manifold). This forward process is computationally efficient because any level of noise can be added in closed form a single step.<\/li>\n<li class=\"wp-block-list-item\">The reverse noising process is challenging because we need to predict how to remove the noise at each step without knowing the original data point in advance. We train a ML model to do this by giving it many examples of data noised at different timesteps.<\/li>\n<li class=\"wp-block-list-item\">Using very small steps in the forward noising process makes it easier for the model to learn to reverse these steps, as the changes are small.<\/li>\n<li class=\"wp-block-list-item\">By applying the reverse noising process iteratively, the model refines noisy samples step by step, eventually producing a realistic data point (one that lies on the data manifold).<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\">Takeaway<\/h4>\n<p class=\"wp-block-paragraph\">Diffusion models are a powerful framework for learning complex data distributions. The distributions are learnt implicitly by modelling a sequential denoising process. This process can then be used to generate samples similar to those in the training distribution.<\/p>\n<h3 class=\"wp-block-heading\">Once you\u2019ve trained a model, how do you get useful stuff out of\u00a0it?<\/h3>\n<p class=\"wp-block-paragraph\">Earlier uses of generative AI such as \u201c<a href=\"https:\/\/thispersondoesnotexist.com\/\" rel=\"noreferrer noopener\" target=\"_blank\">This Person Does Not Exist<\/a>\u201d (<em>ca<\/em>. 2019) made waves simply because it was the first time most people had seen AI-generated photorealistic human faces. A generative adversarial network or \u201cGAN\u201d was used in that case, but the principle remains the same: the model implicitly learnt a underlying data distribution\u200a\u2014\u200ain that case, human faces\u200a\u2014\u200athen sampled from it. So far, our glyffuser model does a similar thing: it samples randomly from the distribution of Chinese glyphs.<\/p>\n<p class=\"wp-block-paragraph\">The question then arises: can we do something more useful than just sample randomly? You\u2019ve likely already encountered text-to-image models such as Dall-E. They are able to incorporate extra meaning from text prompts into the diffusion process\u200a\u2014\u200athis in known as <em>conditioning<\/em>. Likewise, diffusion models for scientific scientific applications like protein (e.g. <a href=\"https:\/\/github.com\/generatebio\/chroma\" rel=\"noreferrer noopener\" target=\"_blank\">Chroma<\/a>, <a href=\"https:\/\/github.com\/RosettaCommons\/RFdiffusion\" rel=\"noreferrer noopener\" target=\"_blank\">RFdiffusion<\/a>, <a href=\"https:\/\/github.com\/google-deepmind\/alphafold3\" rel=\"noreferrer noopener\" target=\"_blank\">AlphaFold3<\/a>) or inorganic crystal structure generation (e.g. <a href=\"https:\/\/arxiv.org\/abs\/2312.03687\" rel=\"noreferrer noopener\" target=\"_blank\">MatterGen<\/a>) become much more useful if can be conditioned to generate samples with desirable properties such as a specific symmetry, bulk modulus, or band gap.<\/p>\n<h4 class=\"wp-block-heading\">Conditional distributions<\/h4>\n<p class=\"wp-block-paragraph\">We can consider conditioning as a way to guide the diffusion sampling process towards particular regions of our probability distribution. We mentioned conditional distributions <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#forward-diffusion-intuition\" rel=\"noreferrer noopener\" target=\"_blank\">in the context of forward diffusion<\/a>. Below we show how conditioning can be thought of as reshaping a base distribution.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"587\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_A3Xif4trZJMzr0MY8c7hng.png?resize=606%2C587&#038;ssl=1\" alt=\"\" class=\"wp-image-597475 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_A3Xif4trZJMzr0MY8c7hng.png 606w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_A3Xif4trZJMzr0MY8c7hng-300x291.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\"><figcaption class=\"wp-element-caption\">A simple example of a joint probability distribution <em>p<\/em>(<em>x<\/em>, <em>y<\/em>), shown as a contour map, along with its two marginal 1-D probability distributions, <em>p<\/em>(<em>x<\/em>) and <em>p<\/em>(<em>y<\/em>). The highest points of <em>p<\/em>(<em>x<\/em>, <em>y<\/em>) are at (<em>x<\/em>\u2081, <em>y<\/em>\u2081) and (<em>x<\/em>\u2082, <em>y<\/em>\u2082). The conditional distributions <em>p<\/em>(<em>x<\/em>\u2223<em>y<\/em> = <em>y<\/em>\u2081) and <em>p<\/em>(<em>x<\/em>\u2223<em>y<\/em> = <em>y<\/em>\u2082) are shown overlaid on the main plot.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Consider the figure above. Think of <em>p<\/em>(<em>x<\/em>) as a distribution we want to sample from (i.e., the images) and <em>p<\/em>(<em>y<\/em>) as conditioning information (i.e., the text dataset). These are the marginal distributions of a joint distribution <em>p<\/em>(<em>x<\/em>, <em>y<\/em>). Integrating <em>p<\/em>(<em>x<\/em>, <em>y<\/em>) over <em>y<\/em> recovers <em>p<\/em>(<em>x<\/em>), and vice versa.<\/p>\n<p class=\"wp-block-paragraph\">Sampling from <em>p<\/em>(<em>x<\/em>), we are equally likely to get <em>x<\/em>\u2081 or <em>x<\/em>\u2082. However, we can condition on <em>p<\/em>(<em>y<\/em> = <em>y<\/em>\u2081) to obtain <em>p<\/em>(<em>x<\/em>\u2223<em>y<\/em> = <em>y<\/em>\u2081). You can think of this as taking a slice through <em>p<\/em>(<em>x<\/em>, <em>y<\/em>) at a given value of <em>y<\/em>. In this conditioned distribution, we are much more likely to sample at <em>x<\/em>\u2081 than <em>x<\/em>\u2082.<\/p>\n<p class=\"wp-block-paragraph\">In practice, in order to condition on a text dataset, we need to convert the text into a numerical form. We can do this using <em>large language model (LLM) embeddings<\/em> that can be injected into the noise prediction model during training.<\/p>\n<h4 class=\"wp-block-heading\">Embedding text with an\u00a0LLM<\/h4>\n<p class=\"wp-block-paragraph\">In the glyffuser, our conditioning information is in the form of <a href=\"https:\/\/github.com\/unicode-org\/unihan-database\/blob\/main\/kDefinition.txt\" rel=\"noreferrer noopener\" target=\"_blank\">English text definitions<\/a>. We have two requirements: 1) ML models prefer fixed-length vectors as input. 2) The numerical representation of our text must understand context\u200a\u2014\u200aif we have the words \u201clithium\u201d and \u201celement\u201d nearby, the meaning of \u201celement\u201d should be understood as \u201cchemical element\u201d rather than \u201cheating element\u201d. Both of these requirements can be met by using a pre-trained LLM.<\/p>\n<p class=\"wp-block-paragraph\">The diagram below shows how an LLM converts text into fixed-length vectors. The text is first <em>tokenized<\/em> (LLMs break text into <em>tokens<\/em>, small chunks of characters, as their basic unit of interaction). Each token is converted into a base <em>embedding<\/em>, which is a fixed-length vector of the size of the LLM input. These vectors are then passed through the pre-trained LLM (here we use the <em>encoder<\/em> portion of Google\u2019s T5 model), where they are imbued with additional contextual meaning. We end up with a array of <em>n<\/em> vectors of the same length <em>d<\/em>, i.e. a (<em>n, d<\/em>) sized tensor.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"b49693\" data-has-transparency=\"true\" style=\"--dominant-color: #b49693;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA-1024x576.png?resize=1024%2C576&#038;ssl=1\" alt=\"\" class=\"wp-image-597476 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA-768x432.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA-1536x864.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_DVNtmX-AmzCQ6mRB0Uu4xA.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">We can convert text to a numerical embedding imbued with contextual meaning using a pre-trained LLM.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Note: in some models, notably Dall-E, additional image-text alignment is performed using <a href=\"https:\/\/arxiv.org\/abs\/2112.10741\" rel=\"noreferrer noopener\" target=\"_blank\"><em>contrastive pretraining<\/em><\/a>. <a href=\"https:\/\/arxiv.org\/abs\/2205.11487\" rel=\"noreferrer noopener\" target=\"_blank\">Imagen<\/a> seems to show that we can get away without doing this.<\/p>\n<h4 class=\"wp-block-heading\">Training the diffusion model with text conditioning<\/h4>\n<p class=\"wp-block-paragraph\">The exact method that this embedding vector is injected into the model can vary. In Google\u2019s <a href=\"https:\/\/arxiv.org\/pdf\/2205.11487\" rel=\"noreferrer noopener\" target=\"_blank\">Imagen<\/a> model, for example, the embedding tensor is pooled (combined into a single vector in the embedding dimension) and added into the data as it passes through the noise prediction model; it is also included in a different way using <em>cross-attention<\/em> (a method of learning contextual information between sequences of tokens, most famously used in the <em>transformer<\/em> models that form the basis of LLMs like ChatGPT).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e3d9d2\" data-has-transparency=\"true\" style=\"--dominant-color: #e3d9d2;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"436\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw-1024x436.png?resize=1024%2C436&#038;ssl=1\" alt=\"\" class=\"wp-image-597477 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw-1024x436.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw-300x128.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw-768x327.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw-1536x655.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_OMSAlCGtyvjWJW77-QjTDw.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Conditioning information can be added <em>via<\/em> multiple different methods but the training loss remains the same.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In the glyffuser, we only use cross-attention to introduce this conditioning information. While a significant architectural change is required to introduce this additional information into the model, the loss function for our noise prediction model remains exactly the same.<\/p>\n<h3 class=\"wp-block-heading\">Testing the conditioned diffusion model<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s do a simple test of the fully trained conditioned diffusion model. In the figure below, we try to denoise in a single step with the text prompt \u201cGold\u201d. As touched upon in our <a href=\"https:\/\/yue-here.com\/posts\/diffusion\/#dataset-visualization\" rel=\"noreferrer noopener\" target=\"_blank\">interactive UMAP<\/a>, Chinese characters often contain components known as <em>radicals<\/em> which can convey sound (phonetic radicals) or meaning (semantic radicals). A common semantic radical is derived from the character meaning \u201cgold\u201d, \u201c\u91d1\u201d, and is used in characters that are in some broad sense associated with gold or metals.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0dacb\" data-has-transparency=\"false\" style=\"--dominant-color: #f0dacb;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"500\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS-1024x500.png?resize=1024%2C500&#038;ssl=1\" alt=\"\" class=\"wp-image-597478 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS-1024x500.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS-300x146.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS-768x375.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS-1536x750.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0_5LZhzCHqsoyTBRCS.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Even with a single sampling step, conditioning guides denoising towards the relevant regions of the probability distribution.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The figure shows that even though a single step is insufficient to approximate the denoising trajectory very well, we have moved into a region of our probability distribution with the \u201c\u91d1\u201d radical. This indicates that the text prompt is effectively guiding our sampling towards a region of the glyph probability distribution related to the meaning of the prompt. The animation below shows a 120 step denoising sequence for the same prompt, \u201cGold\u201d. You can see that every generated glyph has either the \u91d2 or \u9485 radical (the same radical in traditional and simplified Chinese, respectively).<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"6c6d6c\" data-has-transparency=\"false\" style=\"--dominant-color: #6c6d6c;\" loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"284\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_iQQMTOAG7dwM0KmYCg3cCg-1.gif?resize=522%2C284&#038;ssl=1\" alt=\"\" class=\"wp-image-597479 not-transparent\"><\/figure>\n<p class=\"wp-block-paragraph\"><strong>Takeaway<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Conditioning enables us to sample meaningful outputs from diffusion models.<\/p>\n<h3 class=\"wp-block-heading\">Further remarks<\/h3>\n<p class=\"wp-block-paragraph\">I found that with the help of tutorials and existing libraries, it was possible to implement a working diffusion model despite not having a full understanding of what was going on under the hood. I think this is a good way to start learning and highly recommend Hugging Face\u2019s <a href=\"https:\/\/huggingface.co\/docs\/diffusers\/tutorials\/basic_training\" rel=\"noreferrer noopener\" target=\"_blank\">tutorial<\/a> on training a simple diffusion model using their <code>diffusers<\/code> Python library (which now includes my small <a href=\"https:\/\/github.com\/huggingface\/diffusers\/pull\/8223\" rel=\"noreferrer noopener\" target=\"_blank\">bugfix<\/a>!).<\/p>\n<p class=\"wp-block-paragraph\">I\u2019ve omitted some topics that are crucial to how production-grade diffusion models function, but are unnecessary for core understanding. One is the question of how to generate high resolution images. In our example, we did everything in pixel space, but this becomes very computationally expensive for large images. The general approach is to perform diffusion in a smaller space, then upscale it in a separate step. Methods include latent diffusion (used in Stable Diffusion) and cascaded super-resolution models (used in Imagen). Another topic is classifier-free guidance, a very elegant method for boosting the conditioning effect to give much better prompt adherence. I show the implementation in my previous post on the <a href=\"https:\/\/yue-here.com\/posts\/glyffuser\/\" rel=\"noreferrer noopener\" target=\"_blank\">glyffuser<\/a> and highly recommend <a href=\"https:\/\/sander.ai\/2022\/05\/26\/guidance.html\" rel=\"noreferrer noopener\" target=\"_blank\">this article<\/a> if you want to learn more.<\/p>\n<h3 class=\"wp-block-heading\">Further reading<\/h3>\n<p class=\"wp-block-paragraph\">A non-exhaustive list of materials I found very helpful:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Jonathan Ho\u2019s paper, <a href=\"https:\/\/arxiv.org\/abs\/2006.11239\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Denoising Diffusion Probabilistic Models<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Yang Song\u2019s article on score-based models, <a href=\"https:\/\/yang-song.net\/blog\/2021\/score\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Generative Modeling by Estimating Gradients of the Data Distribution<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Calvin Luo\u2019s article <a href=\"https:\/\/calvinyluo.com\/2022\/08\/26\/diffusion-tutorial.html\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Understanding Diffusion Models: A Unified Perspective<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Lilian Weng\u2019s blog post, <a href=\"https:\/\/lilianweng.github.io\/posts\/2021-07-11-diffusion-models\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>What are diffusion models?<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Jeremy Howard\u2019s course <a href=\"https:\/\/course.fast.ai\/Lessons\/part2.html\" target=\"_blank\" rel=\"noreferrer noopener\"><em>From Deep Learning Foundations to Stable Diffusion<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Ryan O\u2019Connor\u2019s tutorial <a href=\"https:\/\/www.assemblyai.com\/blog\/minimagen-build-your-own-imagen-text-to-image-model\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>MinImagen\u200a\u2014\u200aBuild Your Own Imagen Text-to-Image Model<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Jonathan Kernes\u2019 article <a href=\"https:\/\/towardsdatascience.com\/diffusion-models-91b75430ec2\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Diffusion Models<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Sander Dieleman\u2019s <a href=\"https:\/\/sander.ai\/2023\/07\/20\/perspectives.html\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Perspectives on diffusion<\/em><\/a> and <a href=\"https:\/\/sander.ai\/2022\/05\/26\/guidance.html\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Guidance: a cheat code for diffusion models<\/em><\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Stefano Ermon\u2019s Stanford CS236 course <a href=\"https:\/\/deepgenerativemodels.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Deep Generative Models<\/em><\/a>\n<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Fun extras<\/h3>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"474747\" data-has-transparency=\"false\" style=\"--dominant-color: #474747;\" loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"552\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1_kZw2iHYLPt119QUVzcMQhA.gif?resize=522%2C552&#038;ssl=1\" alt=\"\" class=\"wp-image-597480 not-transparent\"><\/figure>\n<p class=\"wp-block-paragraph\">Diffusion sampling using the <code>&lt;a href=\"https:\/\/huggingface.co\/docs\/diffusers\/main\/en\/api\/schedulers\/dpm_sde\/\" target=\"_blank\" rel=\"noreferrer noopener\"&gt;DPMSolverSDEScheduler&lt;\/a&gt;<\/code> developed by <a href=\"https:\/\/github.com\/crowsonkb\/\" target=\"_blank\" rel=\"noreferrer noopener\">Katherine Crowson<\/a> and implemented in Hugging Face <code>diffusers<\/code>\u2014note the smooth transition from noise to data.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/a-visual-guide-to-how-diffusion-models-work\/\">A Visual Guide to How Diffusion Models\u00a0Work<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Yue Wu<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/a-visual-guide-to-how-diffusion-models-work\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Visual Guide to How Diffusion Models\u00a0Work This article is aimed at those who want to understand exactly how Diffusion Models work, with no prior knowledge expected. I\u2019ve tried to use illustrations wherever possible to provide visual intuitions on each part of these models. I\u2019ve kept mathematical notation and equations to a minimum, and where [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1662,67,1663,1664,70,1665],"tags":[454,103,73],"class_list":["post-1710","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-inteligence","category-deep-dives","category-diffusion-models","category-generative-ai","category-machine-learning","category-text-to-image-generation","tag-diffusion","tag-model","tag-models"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1710"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1710"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1710\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}