* add conf docs * add conf docs * add index * add index * update ref * test root * add en * test relative * redirect relative * add document * test_document * test_document
9.4 KiB
Basic Principles of Diffusion Models
This document introduces the basic principles of Diffusion models to help you understand how the training framework is constructed. To make these complex mathematical theories easier for readers to understand, we have reconstructed the theoretical framework of Diffusion models, abandoning complex stochastic differential equations and presenting them in a more concise and understandable form.
Introduction
Diffusion models generate clear images or video content through iterative denoising. We start by explaining the generation process of a data sample x_0. Intuitively, in a complete round of denoising, we start from random Gaussian noise x_T and iteratively obtain x_{T-1}, x_{T-2}, x_{T-3}, \cdots, gradually reducing the noise content at each step until we finally obtain the noise-free data sample x_0.
This process is intuitive, but to understand the details, we need to answer several questions:
- How is the noise content at each step defined?
- How is the iterative denoising computation performed?
- How to train such Diffusion models?
- What is the architecture of modern Diffusion models?
- How does this project encapsulate and implement model training?
How is the noise content at each step defined?
In the theoretical system of Diffusion models, the noise content is determined by a series of parameters \sigma_T, \sigma_{T-1}, \sigma_{T-2}, \cdots, \sigma_0. Where:
\sigma_T=1, corresponding tox_Tas pure Gaussian noise\sigma_T>\sigma_{T-1}>\sigma_{T-2}>\cdots>x_0, the noise content gradually decreases during iteration\sigma_0=0, corresponding tox_0as a data sample without any noise
As for the intermediate values \sigma_{T-1}, \sigma_{T-2}, \cdots, \sigma_1, they are not fixed and only need to satisfy the decreasing condition.
At an intermediate step, we can directly synthesize noisy data samples x_t=(1-\sigma_t)x_0+\sigma_t x_T.
How is the iterative denoising computation performed?
Before understanding the iterative denoising computation, we need to clarify what the input and output of the denoising model are. We abstract the model as a symbol \hat \epsilon, whose input typically consists of three parts:
- Time step
t, the model needs to understand which stage of the denoising process it is currently in - Noisy data sample
x_t, the model needs to understand what data to denoise - Guidance condition
c, the model needs to understand what kind of data sample to generate through denoising
Among these, the guidance condition c is a newly introduced parameter that is input by the user. It can be text describing the image content or a sketch outlining the image structure.
The model's output \hat \epsilon(x_t,c,t) approximately equals x_T-x_0, which is the direction of the entire diffusion process (the reverse process of denoising).
Next, we analyze the computation occurring in one iteration. At time step t, after the model computes an approximation of x_T-x_0, we calculate the next x_{t-1}:
\begin{aligned}
x_{t-1}&=x_t + (\sigma_{t-1} - \sigma_t) \cdot \hat \epsilon(x_t,c,t)\\
&\approx x_t + (\sigma_{t-1} - \sigma_t) \cdot (x_T-x_0)\\
&=(1-\sigma_t)x_0+\sigma_t x_T + (\sigma_{t-1} - \sigma_t) \cdot (x_T-x_0)\\
&=(1-\sigma_{t-1})x_0+\sigma_{t-1}x_T
\end{aligned}
Perfect! It perfectly matches the noise content definition at time step t-1.
(This part might be a bit difficult to understand. Don't worry; it's recommended to skip this part on first reading without affecting the rest of the document.)
After completing this somewhat complex formula derivation, let's consider a question: why should the model's output approximately equal
x_T-x_0? Can it be set to other values?Actually, Diffusion models rely on two definitions to form a complete theory. From the above formulas, we can extract these two definitions and derive the iterative formula:
- Data definition:
x_t=(1-\sigma_t)x_0+\sigma_t x_T- Model definition:
\hat \epsilon(x_t,c,t)=x_T-x_0- Derived iterative formula:
x_{t-1}=x_t + (\sigma_{t-1} - \sigma_t) \cdot \hat \epsilon(x_t,c,t)These three mathematical formulas are complete. For example, in the previous derivation, substituting the data definition and model definition into the iterative formula yields
x_{t-1}that matches the data definition.These are two definitions built on Flow Matching theory, but Diffusion models can also be implemented with other definitions. For example, early models based on DDPM (Denoising Diffusion Probabilistic Models) have their two definitions and derived iterative formulas as:
- Data definition:
x_t=\sqrt{\alpha_t}x_0+\sqrt{1-\alpha_t}x_T- Model definition:
\hat \epsilon(x_t,c,t)=x_T- Derived iterative formula:
x_{t-1}=\sqrt{\alpha_{t-1}}\left(\frac{x_t-\sqrt{1-\alpha_t}\hat \epsilon(x_t,c,t)}{\sqrt{\sigma_t}}\right)+\sqrt{1-\alpha_{t-1}}\hat \epsilon(x_t,c,t)More generally, we describe the derivation process of the iterative formula using matrices. For any data definition and model definition:
- Data definition:
x_t=C_T(x_0,x_T)^T- Model definition:
\hat \epsilon(x_t,c,t)=C_T^{[\epsilon]}(x_0,x_T)^T- Derived iterative formula:
x_{t-1}=C_{t-1}(C_t,C_t^{[\epsilon]})^{-T}(x_t,\hat \epsilon(x_t,c,t))^TWhere
C_tandC_t^{[\epsilon]}are1\times 2coefficient matrices. It's not difficult to see that when constructing the two definitions, the matrix(C_t,C_t^{[\epsilon]})^Tmust be invertible.Although Flow Matching and DDPM have been widely verified by numerous pre-trained models, this doesn't mean they are optimal solutions. We encourage developers to design new Diffusion model theories for better training results.
How to train such Diffusion models?
After understanding the iterative denoising process, we next consider how to train such Diffusion models.
The training process differs from the generation process. If we retain multi-step iterations during training, the gradient would need to backpropagate through multiple steps, bringing catastrophic time and space complexity. To improve computational efficiency, we randomly select a time step t for training.
The following is pseudocode for the training process:
Obtain data sample
x_0and guidance conditioncfrom the datasetRandomly sample time step
t\in(0,T]Randomly sample Gaussian noise
x_T\in \mathcal N(O,I)
x_t=(1-\sigma_t)x_0+\sigma_t x_T
\hat \epsilon(x_t,c,t)Loss function
\mathcal L=||\hat \epsilon(x_t,c,t)-(x_T-x_0)||_2^2Backpropagate gradients and update model parameters
What is the architecture of modern Diffusion models?
From theory to practice, more details need to be filled in. Modern Diffusion model architectures have matured, with mainstream architectures following the "three-stage" architecture proposed by Latent Diffusion, including data encoder-decoder, guidance condition encoder, and denoising model.
Data Encoder-Decoder
In the previous text, we consistently referred to x_0 as a "data sample" rather than an image or video because modern Diffusion models typically don't process images or videos directly. Instead, they use an Encoder-Decoder architecture model, usually a VAE (Variational Auto-Encoders) model, to encode images or videos into Embedding tensors, obtaining x_0.
After data is encoded by the encoder and then decoded by the decoder, the reconstructed content is approximately consistent with the original, with minor errors. So why process on the encoded Embedding tensor instead of directly on images or videos? The main reasons are twofold:
- Encoding compresses the data simultaneously, reducing computational load during processing.
- Encoded data distribution is more similar to Gaussian distribution, making it easier for denoising models to model the data.
During generation, the encoder part doesn't participate in computation. After iteration completes, the decoder part decodes x_0 to obtain clear images or videos. During training, the decoder part doesn't participate in computation; only the encoder is used to compute x_0.
Guidance Condition Encoder
User-input guidance conditions c can be complex and diverse, requiring specialized encoder models to process them into Embedding tensors. According to the type of guidance condition, we classify guidance condition encoders into the following categories:
- Text type, such as CLIP, Qwen-VL
- Image type, such as ControlNet, IP-Adapter
- Video type, such as VAE
The model
\hat \epsilonmentioned in the previous text refers to the entirety of all guidance condition encoders and the denoising model. We list guidance condition encoders separately because these models are typically frozen during Diffusion training, and their output values are independent of time stept, allowing guidance condition encoder computations to be performed offline.
Denoising Model
The denoising model is the true essence of Diffusion models, with diverse model structures such as UNet and DiT. Model developers can freely innovate on these structures.
How does this project encapsulate and implement model training?
Please read the next document: Standard Supervised Training