How AIs Generate Images

Diffusion Explained

Louis-François Bouchard

Francois Huppe-Marcoux

, and

Omar Solano

May 23, 2024

Good morning, AI enthusiasts; in this iteration, we will cover the process used in all state-of-the-art image generation models: Diffusion! Be ready to have a better understanding of how diffusion models are used to create fascinating images…

🤔 Why is this relevant?

Understanding the diffusion process is essential because it forms the foundation of a powerful class of generative models, enabling highly realistic image synthesis, denoising, and data augmentation. It’s even used in the recent AlphaFold 3 model for protein generation!

Let’s explore it at three levels of complexity, from simple to expert. Let us know which one you stopped at!

🌱 ELI5

A diffusion model is like using an eraser on a sketch: it starts with a clear picture, adds random smudges (noise), and then learns to erase them to recover the original image, practicing on millions of sketches.

👨‍💻 Nerd

The diffusion process begins with a clear image and gradually introduces noise—random visual static that obscures the original content. As the model is trained, it learns to reverse this process: it systematically removes the noise step-by-step. By doing this, the model becomes adept at discerning important features from the noise, effectively learning to restore the original image from a seemingly chaotic state. This enhances the model’s ability to generate realistic images and intricate details by understanding how to reconstruct clarity from disorder.

🧙Technical

In diffusion models, an image or other data is gradually transformed from a structured state to a purely random state through a process called the forward diffusion or noising process, described by a series of latent variables

\(x_1, x_2, \ldots, x_T\)

Each step adds noise, increasing the entropy of the data.

The model then learns to reverse this process during training, effectively learning how to denoise the data. This reverse process is mathematically represented by the equation

\(x_{t-1} = f(x_t, \theta)\)

where f is a learned function parameterized by theta (θ), and xt is the state of the data at step t.

The denoising process relies on predicting the noise that was added at each step and subtracting it out, progressively reconstructing the data from pure noise to its original structured form. The model's ability to reverse the noising process demonstrates its understanding of the data's underlying structure.

Denoising of a Noisy Image: This sequence shows the original structured image followed by its noisy version, and a single step of denoising. The denoising step employs a learnt gaussian blurring to undo the blurring effect, effectively clarifying and restoring the image closer to its original form.

👍 Pros vs. prior work

High-Quality Outputs: Generates exceptionally high-quality and diverse images.
Robust to Noise: Effectively handles and removes noise, enhancing data clarity.
Flexible Data Adaptation: Adapts well to different types of data for generation.
Advanced Sampling Control: Allows detailed control over the generation process through conditioning.
Share

⚠️ Limitations

High Computational Cost: Requires significant computational resources due to iterative processes.
Long Training Times: Typically involves longer training periods compared to other generative models.
Complex Implementation: More complex algorithms and training procedures.
Sensitivity to Hyperparameters: Performance heavily depends on precise tuning of hyperparameters.

We hope this iteration bring you the “Aaahh! That’s how it works!” revelation and allow you to discuss it further with your peers. If you need to learn more about diffusion, we invite you to read the resource that helped us build this iteration: High-Resolution Image Synthesis with Latent Diffusion Models.

High Learning Rate

Discussion about this post