An In-Depth Guide To Denoising Diffusion Probabilistic Models - From Theory To Implementation
An In-Depth Guide To Denoising Diffusion Probabilistic Models - From Theory To Implementation
MARCH 6, 2023
Diffusion probabilistic models are an exciting new area of research showing great promise in image generation. In retrospect, diffusion-based
generative models were first introduced in 2015 and popularized in 2020 when Ho et al. published the paper “Denoising Diffusion Probabilistic Models”
(DDPMs). DDPMs are responsible for making diffusion models practical. In this article, we will highlight the key concepts and techniques behind
DDPMs and train DDPMs from scratch on a “flowers” dataset for unconditional image generation.
0:00 / 0:08
In DDPMs, the authors changed the formulation and model training procedures which helped to improve and achieve “image fidelity” rivaling GANs and
established the validity of these new generative algorithms.
The best approach to completely understanding “Denoising Diffusion Probabilistic Models” is by going over both theory (+ some math) and the
underlying code. With that in mind, let’s explore the learning path where:
We’ll first explain what generative models are and why they are needed.
We’ll discuss, from a theoretical standpoint, the approach used in diffusion-based generative models
We’ll explore all the math necessary to understand denoising diffusion probabilistic models.
Finally, we’ll discuss the training and inference used in DDPMs for image generation and code it from scratch in PyTorch.
Table of Contents
3.3. Training Objective & Loss Function Used In Denoising Diffusion Probabilistic Models
7. Visualizing Dataset
14. Summary
We need to create and train generative models because the set of all possible images that can be represented by, say, just (256x256x3) images is
enormous. An image must have the right pixel value combinations to represent something meaningful (something we can understand).
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_sunflower.jpg)
For example, for the above image to represent a “Sunflower”, the pixels in the image need to be in the right configuration (they need to have the right
values). And the space where such images exist is just a fraction of the entire set of images that can be represented by a (256x256x3) image space.
Now, if we knew how to get/sample a point from this subspace, we wouldn’t need to build “‘generative models.” However, at this point in time, we don’t.
😓
The probability distribution function or, more precisely, probability density function (PDF) that captures/models this (data) subspace remains unknown
and most likely too complex to make sense.
This is why we need ‘Generative models — To figure out the underlying likelihood function our data satisfies.
PS: A PDF is a “probability function” representing the density (likelihood) of a continuous random variable – which, in this case, means a function
representing the likelihood of an image lying between a specific range of values defined by the function’s parameters.
PPS: Every PDF has a set of parameters that determine the shape and probabilities of the distribution. The shape of the distribution changes as the
parameter values change. For example, in the case of a normal distribution, we have mean µ (mu) and variance σ2 (sigma) that control the distribution’s
center point and spread.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models_gaussian_distribution_example.png)
In this section, we’ll explain diffusion-based generative models from a logical and theoretical perspective. Next, we’ll review all the math required to
understand and implement Denoising Learn the state-of-the-art
Diffusion Probabilistic Modelsinfrom
AI: DALLE2,
scratch. MidJourney, Stable Diffusion!
Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
Diffusion models are a class of generative models inspired by an idea in Non Equilibrium Statistical Physics which states:
Diffusion models are a class of generative models inspired by an idea in Non-Equilibrium Statistical Physics, which states:
“We can gradually convert one distribution into another using a Markov chain”
Diffusion generative models are composed of two opposite processes i.e., Forward & Reverse Diffusion Process.
– Pearl S. Buck
1. In the “Forward Diffusion” process, we slowly and iteratively add noise to (corrupt) the images in our training set such that they “move out or move
away” from their existing subspace.
2. What we are doing here is converting the unknown and complex distribution that our training set belongs to into one that is easy for us to sample
a (data) point from and understand.
3. At the end of the forward process, the images become entirely unrecognizable. The complex data distribution is wholly transformed into a
(chosen) simple distribution. Each image gets mapped to a space outside the data subspace.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models_forward_process_changing_distribution.png)
Source: https://ayandas.me/blog-tut/2021/12/04/diffusion-prob-models.html
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve
state-of-the-art synthesis results on image data and beyond.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_moving_from_simple_to_data_space-1.png)
1. In the “Reverse Diffusion process,” the idea is to reverse the forward diffusion process.
2. We slowly and iteratively try to reverse the corruption performed on images in the forward process.
3. The reverse process starts where the forward process ends.
4. The benefit of starting from a simple space is that we know how to get/sample a point from this simple distribution (think of it as any point
outside the data subspace).
5. And our goal here is to figure out how to return to the data subspace.
6. However, the problem is that we can take infinite paths starting from a point in this “simple” space, but only a fraction of them will take us to the
“data” subspace.
7. In diffusion probabilistic models, this is done by referring to the small iterative steps taken during the forward diffusion process.
8. The PDF that satisfies the corrupted images in the forward process differs slightly at each step.
Learn
9. Hence, in the reverse process, we usethe state-of-the-art
a deep-learning ineach
model at AI: DALLE2, MidJourney,
step to predict the PDFStable Diffusion!
parameters of the forward process.
Claim
10. And once we train the model, we can Now
start (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
from any point in the simple space and use the model to iteratively take steps to lead us back to the
data subspace
data subspace.
11. In reverse diffusion, we iteratively perform the “denoising” in small steps, starting from a noisy image.
12. This approach for training and generating new samples is much more stable than GANs and better than previous approaches like variational
autoencoders (VAE) and normalizing flows.
(https://learnopencv.com/wp-
content/uploads/2023/01/diffusion-models-unconditional_image_generation-1.gif)
Since their introduction in 2020, DDPMs has been the foundation for cutting-edge image generation systems, including DALL-E 2, Imagen, Stable
Diffusion, and Midjourney.
With the huge number of AI art generation tools today, it is difficult to find the right one for a particular use case. In our recent article, we explored all
the different AI art generation tools (/ai-art-generation-tools/)so that you can make an informed choice to generate the best art.
In this section, we’ll cover all the required math while making sure it’s also easy to follow.
Let’s begin…
(https://learnopencv.com/wp-
content/uploads/2023/01/diffusion-models-forwardbackward_process_ddpm.png)
1. –
1. This term is also known as the forward diffusion kernel (FDK).
2. It defines the PDF of an image at timestep t in the forward diffusion process xt given image xt-1.
3. It denotes the “transition function” applied at each step in the forward diffusion process.
2. –
1. Similar to the forward process, it is known as the reverse diffusion kernel (RDK).
2. It stands for the PDF of xt-1 given xt as parameterized by 𝜭. The 𝜭 means that the parameters of the distribution of the reverse process are
learned using a neural network.
3. It’s the “transition function” applied at each step in the reverse diffusion process.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models_forward_diffusion_equations_1.png)
1. We begin by taking an image from our dataset: x0. Mathematically it’s stated as sampling a data point from the original (but unknown) data
distribution: x0 ~ q(x0).
2. The PDF of the forward process is the product of individual distribution starting from timestep 1 → T.
3. The forward diffusion process is fixed and known.
4. All the intermediate noisy images starting from timestep 1 to T are also called “latents.” The dimension of the latents is the same as the original
image. Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
5. The PDF used to define the FDK is aClaim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
“Normal/Gaussian distribution” (eqn. 2).
6 At each timestep t the parameters that define the distribution of image x are set as:
6. At each timestep t, the parameters that define the distribution of image xt are set as:
Mean:
Covariance:
7. The term 𝝱 (beta) is known as the “diffusion rate” and is precalculated using a “variance scheduler”. The term I is an identity matrix. Therefore, the
distribution at each time step is called Isotropic Gaussian.
8. The original image is corrupted at each time step by adding a small amount of gaussian noise (ɛ). The amount of noise added is regulated by the
scheduler.
9. By choosing sufficiently large timesteps and defining a well-behaved schedule of 𝝱t the repeated application of FDK gradually converts the data
distribution to be nearly an isotropic gaussian distribution.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models-overall_forward_diffusion_process-1.png)
How do we get image xt from xt-1 and how is noise added at each time step?
This can be easily understood by using the reparameterization trick in variational autoencoders (/variational-autoencoder-in-
tensorflow/#reparameterization).
Referring to the second equation, we can easily sample image xt from a normal distribution as:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_forward_diffusion_equations_2.png)
1. Here, the epsilon ɛ is the “noise” term that is randomly sampled from the standard gaussian distribution and is first scaled and then added
(scaled) x(t-1).
2. In this way, starting from x0, the original image is iteratively corrupted from t=1…T
In practice, the authors of DDPMs use a “linear variance scheduler” and define 𝝱 in the range [0.0001, 0.02] and set the total timesteps T =
1000
“Diffusion models scale down the data with each forward process step (by a factor) so that variance does not grow when adding
noise.“
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models_forward_diffusion_variance_scheduler.png)
Whenever we need a latent sample x at timestep t, we have to perform t-1 steps in the Markov chain.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
models_forward_diffusion_problem-1.png)
Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
To fix this, the authors of the DDPM reformulated the kernel to directly go from timestep 0 (i.e., from the original image) to timestep t in the process.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_forward_diffusion_problem_fix.png)
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_forward_diffusion_equations_3.png)
And then, by substituting 𝝱’swith 𝛂’sand using the addition property (https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables) of
Gaussian distribution. The forward diffusion process can be rewritten in terms of 𝛂 as:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models_forward_diffusion_equations_4-1.png)
🚀 Using the above formulation, we can sample at any arbitrary timestep t in the Markov chain.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models-
follow_the_forward_path_in_reverse.jpg)
Czech Hiking Markers System. Following the path to take in the return journey.
“In the reverse diffusion process, the task is to learn a finite-time (within T timesteps) reversal of the forward diffusion process.”
This basically means that we have to “undo” the forward process i.e., to remove the noise added in the forward process iteratively. It is done using a
neural network model.
In the forward process, the transitions function q was defined using a Gaussian, so what function should be used for the reverse process p? What should
the neural network learn?
1. In 1949, W. Feller showed that, for gaussian (and binomial) distributions, the diffusion process’s reversal has the same functional form as the
forward process.
2. This means that similar to the FDK, which is defined as a normal distribution, we can use the same functional form (a gaussian distribution) to
define the reverse diffusion kernel.
3. The reverse process is also a Markov chain where a neural network predicts the parameters for the reverse diffusion kernel at each timestep.
4. During training, the learned estimates (of the parameters) should be close to the parameters of the FDK’s posterior at each timestep. We’ll talk
more about FDK’s posterior in the next section.
5. We want this because if we follow the forward trajectory in reverse, we may return to the original data distribution.
6. In doing so, we would also learn how to generate new samples that closely match the underlying data distribution, starting from a pure gaussian
noise (we do not have access to the forward process during inference).
1. The Markov chain for the reverse diffusion starts from where the forward process ends, i.e., at timestep T, where the data distribution has been
converted into (nearly an) isotropic gaussian distribution.
2. The PDF of the reverse diffusion process is an “integral” over all the possible pathways we can take to arrive at a data sample (in the same
distribution as the original) starting from pure noise xT.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models_reverse_diffusion_equations_1.png)
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models-forward_and_backward_equations.png)
Training Objective & Loss Function Used In Denoising Diffusion Probabilistic Models
The training objective of diffusion-based generative models amounts to “maximizing the log-likelihood of the sample generated (at the end of the
reverse process) (x) belonging to the original data distribution.”
We have defined the transition functions in diffusion models as “Gaussians”. To maximize the log-likelihood of a gaussian distribution, it is to try and
find the parameters of the distribution (𝞵, 𝝈2) such that it maximizes the “likelihood” of the (generated) data belonging to the same data distribution as
the original data.
To train our neural network, we define the loss function (L) as the objective function’s negative. So a high value for p𝜭 (x0), means low loss and vice
versa.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models-loss_equations_1.png)
Turns out, this is intractable because we need to integrate over a very high dimensional (pixel) space for continuous values over T timesteps.
Instead, the authors take inspiration from VAEs and reformulate the training objective using a variational lower bound (VLB), also known as “Evidence
lower bound” (ELBO), which is this scary-looking equation 👻:
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models-loss_equations_2.png)
After some simplification, the DDPM authors arrive at this final Lvlb– Variational Lower Bound loss term:
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models-loss_equations_3.png)
We can break the above Lvlb loss term into individual timestep as follows:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models-
loss_equations_4.png)
You may notice that this loss function is huge! But the authors of DDPM further simplify it by ignoring some of the terms in their simplified loss
function.
So Lt-1 is the only loss term left which is a KL divergence between the “posterior” of the forward process (conditioned on xt and the initial sample x0),
and the parameterized reverse diffusion process. Both terms are gaussian distributions as well.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models-loss_equations_5.png)
The job of our deep-learning model during training is to approximate/estimate the parameters of this (gaussian) posterior such that the KL divergence
is as minimal as possible.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models-forward_posterior_reverse_equations.png)
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models-loss_equations_6.png)
To further simplify the task of the model, the authors decided to fix the variance to a constant 𝝱t.
Now, the model only needs to learn to predict the above equation. And the reverse diffusion kernel gets modified to:
As we have kept the variance constant, minimizing KL divergence is as simple as minimizing the difference (or distance) between means (𝞵) of two
gaussian distributions q and p (for e.g. difference between the means of distributions in the left image), which can be done as follows:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-
models-loss_equations_8.png)
By using the third option, and after some simplification, can be expressed as:
At training and inference time, we know the 𝝱’s, 𝛂’s, and xt . So our model only needs to predict the noise at each timestep. The simplified (after
ignoring some weighting terms) loss function used in the Denoising Diffusion Probabilistic Models is as follows:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models-loss_equations_9.png)
Which is basically:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models-
loss_equations_10.png)
This is the final loss function we use to train DDPMs, which is just a “Mean Squared Error” between the noise added in the forward process and the
noise predicted by the model. This is the most impactful contribution of the paper Denoising Diffusion Probabilistic Models.
It’s awesome because, beginning from those scary-looking ELBO terms, we ended up with the simplest loss function in the entire machine learning
domain.
Many variations from the original GANs were created, such as:
1. Conditional GAN (cGAN) (/conditional-gan-cgan-in-pytorch-and-tensorflow/): Controlling the class/category of the generated images.
2. Deep Convolutional GAN (DCGAN) (/deep-convolutional-gan-in-pytorch-and-tensorflow/): architecture significantly improves the quality of GANs
using convolutional layers.
3. Image-to-Image translation with Pix2Pix (/paired-image-to-image-translation-pix2pix/): Converting images from one domain to another by learning
a mapping between the input and output.
Note: code for regularly used helper functions is not added to the post.
💡 You can access the entire codebase for this and all our other posts by simply subscribing to the blog post, and we’ll send you the link to download
link. Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!
Download Code
First and foremost, we’ll define configuration classes that will hold the hyperparameters for loading the dataset, creating log directories, and training
the model.
The flowers dataset can be downloaded from over here: Flowers Recognition | Kaggle (https://www.kaggle.com/datasets/alxmamaev/flowers-
recognition)
When using Kaggle kernels, it’s as simple as just clicking on the “Add Data” component and selecting the dataset.
1. get_dataset(...): Returns the dataset class object that will be passed to the Dataloader. Three preprocessing transforms, and one
augmentation are applied to every image in the dataset.
1. Preprocessing:
1. Convert pixel values from the range [0, 255] → [0.0, 1.0]
2. Resize Images to shape (32x32).
3. Change pixel values from the range [0.0, 1.0] → [-1.0, 1.0]. This is done by the DDPM authors so that the input image roughly
has the same range of values as a standard gaussian.
2. Augmentation:
1. A random horizontal flip, as used in the original implementation. In case you are using the MNIST dataset, be sure to comment out
this line.
2. inverse_transforms(...): This function is used for inverting the transforms applied during the loading step and reverting the image to the
range [0.0, 255.0].
1 import torchvision
2 Learnasthe
import torchvision.transforms TFstate-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
3 import torchvision.datasets as datasets
4 from torch.utils.data import Claim Now DataLoader
Dataset, (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
5
6
7 def get_dataset(dataset_name='MNIST'):
8 transforms = torchvision.transforms.Compose(
9 [
10 torchvision.transforms.ToTensor(),
11 torchvision.transforms.Resize((32, 32),
12 interpolation=torchvision.transforms.InterpolationMode.BICUBIC,
13 antialias=True),
14 torchvision.transforms.RandomHorizontalFlip(),
15 # torchvision.transforms.Normalize(MEAN, STD),
16 torchvision.transforms.Lambda(lambda t: (t * 2) - 1) # Scale between [-1, 1]
17 ]
18 )
19
20 if dataset_name.upper() == "MNIST":
21 dataset = datasets.MNIST(root="data", train=True, download=True, transform=transforms)
22 elif dataset_name == "Cifar-10":
23 dataset = datasets.CIFAR10(root="data", train=True, download=True, transform=transforms)
24 elif dataset_name == "Cifar-100":
25 dataset = datasets.CIFAR10(root="data", train=True, download=True, transform=transforms)
26 elif dataset_name == "Flowers":
27 dataset = datasets.ImageFolder(root="/kaggle/input/flowers-recognition/flowers", transform=transforms)
28
29 return dataset
30
31 def inverse_transform(tensors):
32 """Convert tensors from [-1., 1.] to [0., 255.]"""
33 return ((tensors.clamp(-1, 1) + 1.0) / 2.0) * 255.0
1 def get_dataloader(dataset_name='MNIST',
2 batch_size=32,
3 pin_memory=False,
4 shuffle=True,
5 num_workers=0,
6 device="cpu"
7 ):
8 dataset = get_dataset(dataset_name=dataset_name)
9 dataloader = DataLoader(dataset, batch_size=batch_size,
10 pin_memory=pin_memory,
11 num_workers=num_workers,
12 shuffle=shuffle
13 )
14 # Used for moving batch of data to the user-specified machine: cpu or gpu
15 device_dataloader = DeviceDataLoader(dataloader, device)
16 return device_dataloader
Visualizing Dataset
First, we’ll create the “dataloader” object by calling the get_dataloader(...) function.
1 loader = get_dataloader(
2 dataset_name=BaseConfig.DATASET,
3 batch_size=128,
4 device=”cpu”,
5 )
Then we can simply use torchvision’s make_grid(...) function to plot a grid of flower images.
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-
diffusion-probabilistic-models_flowers_dataset_image-1.png)
Flowers Dataset
From the usual UNet architecture, the authors replaced the original double convolution at each level with “Residual blocks” used in ResNet models.
1. Encoder blocks
2. Bottleneck blocks
3. Decoder blocks
4. Self attention modules (https://learnopencv.com/attention-mechanism-in-transformer-neural-networks/)
5. Sinusoidal time embeddings
Architectural Details:
1. There are four levels in the encoder and decoder path with bottleneck blocks between them.
2. Each encoder stage comprises two residual blocks with convolutional downsampling except the last level.
3. Each corresponding decoder stage comprises three residual blocks and uses 2x nearest neighbors with convolutions to upsample the input from
the previous level.
4. Each stage in the encoder path is connected to the decoder path with the help of skip connections.
5. The model uses “Self-Attention” modules at a single feature map resolution.
6. Every residual block in the model gets the inputs from the previous layer (and others in the decoder path) and the embedding of the current
timestep. The timestep embedding informs the model of the input’s current position in the Markov chain.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models_UNet_model_architecture.png)
In this article, we are working on an image size of (32×32). Only two minor changes exist between our model and the original model used in the paper.
Please note that we are not adding the model code because the code for the UNet + these modifications is quite easy, but because of all the different
components. it becomes just too big to be added to the post.
Diffusion Class
In this section, we are creating a class called SimpleDiffusion. This class contains:
1. Scheduler constants required for performing the forward and reverse diffusion process.
2. A method to define the linear variance scheduler used in DDPMs.
3. A method that performs a single step using the updated forward diffusion kernel.
1 class SimpleDiffusion:
2 def __init__( Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
3 self,
4 Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
num_diffusion_timesteps=1000,
5 img_shape=(3, 64, 64),
6 device="cpu",
7 ):
8 self.num_diffusion_timesteps = num_diffusion_timesteps
9 self.img_shape = img_shape
10 self.device = device
11 self.initialize()
12
13 def initialize(self):
14 # BETAs & ALPHAs required at different places in the Algorithm.
15 self.beta = self.get_betas()
16 self.alpha = 1 - self.beta
17
18 self_sqrt_beta = torch.sqrt(self.beta)
19 self.alpha_cumulative = torch.cumprod(self.alpha, dim=0)
20 self.sqrt_alpha_cumulative = torch.sqrt(self.alpha_cumulative)
21 self.one_by_sqrt_alpha = 1. / torch.sqrt(self.alpha)
22 self.sqrt_one_minus_alpha_cumulative = torch.sqrt(1 - self.alpha_cumulative)
23
24 def get_betas(self):
25 """linear schedule, proposed in original ddpm paper"""
26 scale = 1000 / self.num_diffusion_timesteps
27 beta_start = scale * 1e-4
28 beta_end = scale * 0.02
29 return torch.linspace(
30 beta_start,
31 beta_end,
32 self.num_diffusion_timesteps,
33 dtype=torch.float32,
34 device=self.device,
35 )
The forward_diffusion(...) function takes in a batch of images and corresponding timesteps and adds noise/corrupts the input images using the
updated forward diffusion kernel equation.
1 sd = SimpleDiffusion(num_diffusion_timesteps=TrainingConfig.TIMESTEPS, device="cpu")
2
3 loader = iter( # converting dataloader into an iterator for now.
4 get_dataloader(
5 dataset_name=BaseConfig.DATASET,
6 batch_size=6,
7 device="cpu",
8 )
9 )
Performing the forward process for some specific timesteps and also storing the noisy versions of the original image.
1 x0s, _ = next(loader)
2
3 noisy_images = []
4 specific_timesteps = [0, 10, 50, 100, 150, 200, 250, 300, 400, 600, 800, 999]
5
6 for timestep in specific_timesteps:
7 timestep = torch.as_tensor(timestep, dtype=torch.long)
8
9 xts, _ = sd.forward_diffusion(x0s, timestep)
10 xts = inverse_transform(xts) / 255.0
11 xts = make_grid(xts, nrow=1, padding=1)
12
13 noisy_images.append(xts)
The original image gets increasingly corrupted as timesteps increase. At the end of the forward process, we are left with noise.
(https://learnopencv.com/wp-
content/uploads/2023/02/denoising-diffusion-probabilistic-models_DDPM_trainig_inference_algorithm.png)
The first function defined here is train_one_epoch(...). This function is used for performing “one epoch of training ” i.e., it trains the model by
iterating once over the entire dataset and will be called in our final training loop.
We also use Mixed-Precision training to train the model faster and save GPU memory. The code is pretty straightforward and almost a one-to-one
conversion from the algorithm.
1 # Algorithm 1: Training
2
3 def train_one_epoch(model, loader, sd, optimizer, scaler, loss_fn, epoch=800,
4 base_config=BaseConfig(), training_config=TrainingConfig()):
5
6 loss_record = MeanMetric()
7 model.train()
8
9 with tqdm(total=len(loader), dynamic_ncols=True) as tq:
10 tq.set_description(f"Train :: Epoch: {epoch}/{training_config.NUM_EPOCHS}")
11
12 for x0s, _ in loader: # line 1, 2
13 tq.update(1)
14
15 ts = torch.randint(low=1, high=training_config.TIMESTEPS, size=(x0s.shape[0],), device=base_config.DEVICE) # line 3
16 xts, gt_noise = sd.forward_diffusion(x0s, ts) # line 4
17
18 with amp.autocast():
19 pred_noise = model(xts, ts)
20 loss = loss_fn(gt_noise, pred_noise) # line 5
21
22 optimizer.zero_grad(set_to_none=True)
23 scaler.scale(loss).backward()
24
25 # scaler.unscale_(optimizer)
26 # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
27
28 scaler.step(optimizer)
29 scaler.update()
30
31 loss_value = loss.detach().item()
32 loss_record.update(loss_value)
33
34 tq.set_postfix_str(s=f"Loss: {loss_value:.4f}")
35
36 mean_loss = loss_record.compute().item()
37
38 tq.set_postfix_str(s=f"Epoch Loss: {mean_loss:.4f}")
39
40 return mean_loss
The next function we define is reverse_diffusion(...) which is responsible for performing inference i.e., generating images using the reverse diffusion
process. The function takes in a trained model and the diffusion class and can either generate a video showcasing the entire diffusion process or just
the final generated image.
1 # Algorithm 2: Sampling
2 Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
3 @torch.no_grad()
4 def reverse_diffusion(model, Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
sd, timesteps=1000, img_shape=(3, 64, 64),
5 num_images=5, nrow=8, device="cpu", **kwargs):
6
7 x = torch.randn((num_images, *img_shape), device=device)
8 model.eval()
9
10 if kwargs.get("generate_video", False):
11 outs = []
12
13 for time_step in tqdm(iterable=reversed(range(1, timesteps)),
14 total=timesteps-1, dynamic_ncols=False,
15 desc="Sampling :: ", position=0):
16
17 ts = torch.ones(num_images, dtype=torch.long, device=device) * time_step
18 z = torch.randn_like(x) if time_step > 1 else torch.zeros_like(x)
19
20 predicted_noise = model(x, ts)
21
22 beta_t = get(sd.beta, ts)
23 one_by_sqrt_alpha_t = get(sd.one_by_sqrt_alpha, ts)
24 sqrt_one_minus_alpha_cumulative_t = get(sd.sqrt_one_minus_alpha_cumulative, ts)
25
26 x = (
27 one_by_sqrt_alpha_t
28 * (x - (beta_t / sqrt_one_minus_alpha_cumulative_t) * predicted_noise)
29 + torch.sqrt(beta_t) * z
30 )
31
32 if kwargs.get("generate_video", False):
33 x_inv = inverse_transform(x).type(torch.uint8)
34 grid = make_grid(x_inv, nrow=nrow, pad_value=255.0).to("cpu")
35 ndarr = torch.permute(grid, (1, 2, 0)).numpy()[:, :, ::-1]
36 outs.append(ndarr)
37
38 if kwargs.get("generate_video", False): # Generate and save video of the entire reverse process.
39 frames2vid(outs, kwargs['save_path'])
40 display(Image.fromarray(outs[-1][:, :, ::-1])) # Display the image at the final timestep of the reverse process.
41 return None
42
43 else: # Display and save the image at the final timestep of the reverse process.
44 x = inverse_transform(x).type(torch.uint8)
45 grid = make_grid(x, nrow=nrow, pad_value=255.0).to("cpu")
46 pil_image = TF.functional.to_pil_image(grid)
47 pil_image.save(kwargs['save_path'], format=save_path[-3:].upper())
48 display(pil_image)
49 return None
1 @dataclass
2 class ModelConfig:
3 BASE_CH = 64 # 64, 128, 256, 256
4 BASE_CH_MULT = (1, 2, 4, 4) # 32, 16, 8, 8
5 APPLY_ATTENTION = (False, True, True, False)
6 DROPOUT_RATE = 0.1
7 TIME_EMB_MULT = 4 # 128
8
9 model = UNet(
10 input_channels = TrainingConfig.IMG_SHAPE[0],
11 output_channels = TrainingConfig.IMG_SHAPE[0],
12 base_channels = ModelConfig.BASE_CH,
13 base_channels_multiples = ModelConfig.BASE_CH_MULT,
14 apply_attention = ModelConfig.APPLY_ATTENTION,
15 dropout_rate = ModelConfig.DROPOUT_RATE,
16 time_multiple = ModelConfig.TIME_EMB_MULT,
17 )
18 model.to(BaseConfig.DEVICE)
19
20 optimizer = torch.optim.AdamW(model.parameters(), lr=TrainingConfig.LR) # Original → Adam
21
22 dataloader = get_dataloader(
23 dataset_name = BaseConfig.DATASET,
24 batch_size = TrainingConfig.BATCH_SIZE,
25 device = BaseConfig.DEVICE,
26 pin_memory = True,
27 num_workers = TrainingConfig.NUM_WORKERS,
28 )
29
30 loss_fn = nn.MSELoss()
31
32 sd = SimpleDiffusion(
33 num_diffusion_timesteps = TrainingConfig.TIMESTEPS,
34 img_shape = TrainingConfig.IMG_SHAPE,
35 device = BaseConfig.DEVICE,
36 )
37
38 scaler = amp.GradScaler() # For mixed-precision training.
Then we’ll initialize the logging and checkpoint directories to save intermediate sampling results and model parameters.
1 total_epochs = TrainingConfig.NUM_EPOCHS + 1
2 log_dir, checkpoint_dir = setup_log_directory(config=BaseConfig())
3 generate_video = False
4 ext = ".mp4" if generate_gif else ".png"
Finally, we can write our training loop. As we have divided all our code into simple, easy-to-debug functions and classes, all we have to do now is call
them in the epochs training loop. Learn the state-of-the-art
Specifically, in AI:
we need to call the DALLE2,
“training” MidJourney,
and Stable Diffusion!
“sampling” functions defined in the previous section in a loop.
Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
f h i ( t t l h )
1 for epoch in range(1, total_epochs):
2 torch.cuda.empty_cache()
3 gc.collect()
4
5 # Algorithm 1: Training
6 train_one_epoch(model, sd, dataloader, optimizer, scaler, loss_fn, epoch=epoch)
7
8 if epoch % 20 == 0:
9 save_path = os.path.join(log_dir, f"{epoch}{ext}")
10
11 # Algorithm 2: Sampling
12 reverse_diffusion(model, sd, timesteps=TrainingConfig.TIMESTEPS,
13 num_images=32, generate_video=generate_video, save_path=save_path,
14 img_shape=TrainingConfig.IMG_SHAPE, device=BaseConfig.DEVICE, nrow=4,
15 )
16
17 # clear_output()
18 checkpoint_dict = {
19 "opt": optimizer.state_dict(),
20 "scaler": scaler.state_dict(),
21 "model": model.state_dict()
22 }
23 torch.save(checkpoint_dict, os.path.join(checkpoint_dir, "ckpt.pt"))
24 del checkpoint_dict
If all goes well, the training procedure should start and print the training logs similar to:
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-
diffusion-probabilistic-models_example_training_log.png)
The inference code is simply a call to the reverse_diffusion(...) function using the trained model.
1 generate_video = False # Set it to True for generating video of the entire reverse diffusion proces or False to for saving only
2 Learn the state-of-the-art in AI: DALLE2, MidJourney, Stable Diffusion!
3 ext = ".mp4" if generate_video else ".png"
4 Claim Now (https://www.indiegogo.com/projects/mastering-ai-art-generation#/)
filename = f"{datetime.now().strftime('%Y%m%d-%H%M%S')}{ext}"
5
( )
6 save_path = os.path.join(log_dir, filename)
7
8 reverse_diffusion(
9 model,
10 sd,
11 num_images=256,
12 generate_video=generate_video,
13 save_path=save_path,
14 timesteps=1000,
15 img_shape=TrainingConfig.IMG_SHAPE,
16 device=BaseConfig.DEVICE,
17 nrow=32,
18 )
19 print(save_path)
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models_example_inference_image_1.png)
(https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-
probabilistic-models_example_inference_image_2.png)
0:00 / 0:08
Summary
In conclusion, diffusion models represent a rapidly growing field with a wealth of exciting possibilities for the future. As research in this area continues
to evolve, we can expect even more advanced techniques and applications to emerge. I encourage readers to share their thoughts and questions about
this topic and to engage in conversations about the future of diffusion models.
1. We began by providing an intuitive answer to the fundamental question of why we need generative models.
2. Then we continued the discussion to explain diffusion-based generative models from a logical and theoretical perspective.
3. After building the theoretical base, we introduced all the necessary mathematical equations derived for DDPMs one by one while also maintaining
the flow so that it’s easy to grasp.
4. Finally, we concluded by explaining all the different pieces of code required for training DDPMs from scratch and performing inference. We also
demonstrated the results we got from our experiments.
References
1. What are Diffusion Models? (https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)
2. DDPMs from scratch (https://magic-with-latents.github.io/latent/ddpms-series.html)
3. Diffusion Models | Paper Explanation | Math Explained (https://www.youtube.com/watch?v=HoKDTa5jHvg)
We would love to hear from you. Please feel free to ask questions in the comment section; we are more than happy to converse with you.
🌟Happy learning!
Subscribe Now
All views expressed on this site are my own and do not represent the opinions of OpenCV.org or any Installation Opencv Courses
entity whatsoever with which I have been, am now, or will be affiliated.
PyTorch CV4Faces (Old)
Privacy Policy In 2007, right after finishing my Ph.D., I co-founded TAAZ Inc. with my advisor Dr. David Kriegman and Kevin Barnes. The scalability, and
robustness of our computer vision and machine learning algorithms have been put to rigorous test by more than 100M users who have tried
Terms and Conditions
our products.
Read More
(https://learnopencv.com/about/)