DiffSynth-Studio 2.0 major update

2026-03-18 22:08:13 +00:00 · 2025-12-04 16:33:07 +08:00
parent afd101f345
commit 72af7122b3
758 changed files with 26462 additions and 2221398 deletions
--- a/docs/en/Pipeline_Usage/Environment_Variables.md
+++ b/docs/en/Pipeline_Usage/Environment_Variables.md
@@ -0,0 +1,39 @@
+# Environment Variables
+
+`DiffSynth-Studio` can control some settings through environment variables.
+
+In `Python` code, you can set environment variables using `os.environ`. Please note that environment variables must be set before `import diffsynth`.
+
+```python
+import os
+os.environ["DIFFSYNTH_MODEL_BASE_PATH"] = "./path_to_my_models"
+import diffsynth
+```
+
+On Linux operating systems, you can also temporarily set environment variables from the command line:
+
+```shell
+DIFFSYNTH_MODEL_BASE_PATH="./path_to_my_models" python xxx.py
+```
+
+Below are the environment variables supported by `DiffSynth-Studio`.
+
+## `DIFFSYNTH_SKIP_DOWNLOAD`
+
+Whether to skip model downloads. Can be set to `True`, `true`, `False`, `false`. If `skip_download` is not set in `ModelConfig`, this environment variable will determine whether to skip model downloads.
+
+## `DIFFSYNTH_MODEL_BASE_PATH`
+
+Model download root directory. Can be set to any local path. If `local_model_path` is not set in `ModelConfig`, model files will be downloaded to the path pointed to by this environment variable. If neither is set, model files will be downloaded to `./models`.
+
+## `DIFFSYNTH_ATTENTION_IMPLEMENTATION`
+
+Attention mechanism implementation method. Can be set to `flash_attention_3`, `flash_attention_2`, `sage_attention`, `xformers`, or `torch`. See [`./core/attention.md`](/docs/en/API_Reference/core/attention.md) for details.
+
+## `DIFFSYNTH_DISK_MAP_BUFFER_SIZE`
+
+Buffer size in disk mapping. Default is 1B (1000000000). Larger values occupy more memory but result in faster speeds.
+
+## `DIFFSYNTH_DOWNLOAD_SOURCE`
+
+Remote model download source. Can be set to `modelscope` or `huggingface` to control the source of model downloads. Default value is `modelscope`.
--- a/docs/en/Pipeline_Usage/Model_Inference.md
+++ b/docs/en/Pipeline_Usage/Model_Inference.md
@@ -0,0 +1,105 @@
+# Model Inference
+
+This document uses the Qwen-Image model as an example to introduce how to use `DiffSynth-Studio` for model inference.
+
+## Loading Models
+
+Models are loaded through `from_pretrained`:
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+)
+```
+
+Where `torch_dtype` and `device` are computation precision and computation device (not model precision and device). `model_configs` can be configured in multiple ways for model paths. For how models are loaded internally in this project, please refer to [`diffsynth.core.loader`](/docs/en/API_Reference/core/loader.md).
+
+<details>
+
+<summary>Download and load models from remote sources</summary>
+
+> `DiffSynth-Studio` downloads and loads models from [ModelScope](https://www.modelscope.cn/) by default. You need to fill in `model_id` and `origin_file_pattern`, for example:
+> 
+> ```python
+> ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+> ```
+> 
+> Model files are downloaded to the `./models` path by default, which can be modified through [environment variable DIFFSYNTH_MODEL_BASE_PATH](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_model_base_path).
+
+</details>
+
+<details>
+
+<summary>Load models from local file paths</summary>
+
+> Fill in `path`, for example:
+> 
+> ```python
+> ModelConfig(path="models/xxx.safetensors")
+> ```
+> 
+> For models loaded from multiple files, use a list, for example:
+> 
+> ```python
+> ModelConfig(path=[
+>     "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+>     "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+>     "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+>     "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors",
+> ])
+> ```
+
+</details>
+
+By default, even after models have been downloaded, the program will still query remotely for missing files. To completely disable remote requests, set [environment variable DIFFSYNTH_SKIP_DOWNLOAD](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_skip_download) to `True`.
+
+```shell
+import os
+os.environ["DIFFSYNTH_SKIP_DOWNLOAD"] = "True"
+import diffsynth
+```
+
+To download models from [HuggingFace](https://huggingface.co/), set [environment variable DIFFSYNTH_DOWNLOAD_SOURCE](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_download_source) to `huggingface`.
+
+```shell
+import os
+os.environ["DIFFSYNTH_DOWNLOAD_SOURCE"] = "huggingface"
+import diffsynth
+```
+
+## Starting Inference
+
+Input a prompt to start the inference process and generate an image.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+Each model `Pipeline` has different input parameters. Please refer to the documentation for each model.
+
+If the model parameters are too large, causing insufficient VRAM, please enable [VRAM management](/docs/en/Pipeline_Usage/VRAM_management.md).
--- a/docs/en/Pipeline_Usage/Model_Training.md
+++ b/docs/en/Pipeline_Usage/Model_Training.md
@@ -0,0 +1,247 @@
+# Model Training
+
+This document introduces how to use `DiffSynth-Studio` for model training.
+
+## Script Parameters
+
+Training scripts typically include the following parameters:
+
+* Dataset base configuration
+    * `--dataset_base_path`: Root directory of the dataset.
+    * `--dataset_metadata_path`: Metadata file path of the dataset.
+    * `--dataset_repeat`: Number of times the dataset is repeated in each epoch.
+    * `--dataset_num_workers`: Number of processes for each Dataloader.
+    * `--data_file_keys`: Field names that need to be loaded from metadata, usually image or video file paths, separated by `,`.
+* Model loading configuration
+    * `--model_paths`: Paths of models to be loaded. JSON format.
+    * `--model_id_with_origin_paths`: Model IDs with original paths, for example `"Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors"`. Separated by commas.
+    * `--extra_inputs`: Extra input parameters required by the model Pipeline, for example, training image editing model Qwen-Image-Edit requires extra parameter `edit_image`, separated by `,`.
+    * `--fp8_models`: Models loaded in FP8 format, consistent with the format of `--model_paths` or `--model_id_with_origin_paths`. Currently only supports models whose parameters are not updated by gradients (no gradient backpropagation, or gradients only update their LoRA).
+* Training base configuration
+    * `--learning_rate`: Learning rate.
+    * `--num_epochs`: Number of epochs.
+    * `--trainable_models`: Trainable models, for example `dit`, `vae`, `text_encoder`.
+    * `--find_unused_parameters`: Whether there are unused parameters in DDP training. Some models contain redundant parameters that do not participate in gradient calculation, and this setting needs to be enabled to avoid errors in multi-GPU training.
+    * `--weight_decay`: Weight decay size. See [torch.optim.AdamW](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html) for details.
+    * `--task`: Training task, default is `sft`. Some models support more training modes. Please refer to the documentation for each specific model.
+* Output configuration
+    * `--output_path`: Model save path.
+    * `--remove_prefix_in_ckpt`: Remove prefixes in the state dict of model files.
+    * `--save_steps`: Interval of training steps for saving models. If this parameter is left blank, the model will be saved once per epoch.
+* LoRA configuration
+    * `--lora_base_model`: Which model LoRA is added to.
+    * `--lora_target_modules`: Which layers LoRA is added to.
+    * `--lora_rank`: Rank of LoRA.
+    * `--lora_checkpoint`: Path of LoRA checkpoint. If this path is provided, LoRA will be loaded from this checkpoint.
+    * `--preset_lora_path`: Preset LoRA checkpoint path. If this path is provided, this LoRA will be loaded in the form of being merged into the base model. This parameter is used for LoRA differential training.
+    * `--preset_lora_model`: Model that preset LoRA is merged into, for example `dit`.
+* Gradient configuration
+    * `--use_gradient_checkpointing`: Whether to enable gradient checkpointing.
+    * `--use_gradient_checkpointing_offload`: Whether to offload gradient checkpointing to memory.
+    * `--gradient_accumulation_steps`: Number of gradient accumulation steps.
+* Image dimension configuration (applicable to image generation models and video generation models)
+    * `--height`: Height of images or videos. Leave `height` and `width` blank to enable dynamic resolution.
+    * `--width`: Width of images or videos. Leave `height` and `width` blank to enable dynamic resolution.
+    * `--max_pixels`: Maximum pixel area of images or video frames. When dynamic resolution is enabled, images with resolution larger than this value will be scaled down, and images with resolution smaller than this value will remain unchanged.
+
+Some models' training scripts also contain additional parameters. See the documentation for each model for details.
+
+## Preparing Datasets
+
+`DiffSynth-Studio` adopts a universal dataset format. The dataset contains a series of data files (images, videos, etc.) and annotated metadata files. We recommend organizing dataset files as follows:
+
+```
+data/example_image_dataset/
+├── metadata.csv
+├── image_1.jpg
+└── image_2.jpg
+```
+
+Where `image_1.jpg`, `image_2.jpg` are training image data, and `metadata.csv` is the metadata list, for example:
+
+```
+image,prompt
+image_1.jpg,"a dog"
+image_2.jpg,"a cat"
+```
+
+We have built sample datasets for your testing. To understand how the universal dataset architecture is implemented, please refer to [`diffsynth.core.data`](/docs/en/API_Reference/core/data.md).
+
+<details>
+
+<summary>Sample Image Dataset</summary>
+
+> ```shell
+> modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
+> ```
+> 
+> Applicable to training of image generation models such as Qwen-Image and FLUX.
+
+</details>
+
+<details>
+
+<summary>Sample Video Dataset</summary>
+
+> ```shell
+> modelscope download --dataset DiffSynth-Studio/example_video_dataset --local_dir ./data/example_video_dataset
+> ```
+> 
+> Applicable to training of video generation models such as Wan.
+
+</details>
+
+## Loading Models
+
+Similar to [model loading during inference](/docs/en/Pipeline_Usage/Model_Inference.md#loading-models), we support multiple ways to configure model paths, and the two methods can be mixed.
+
+<details>
+
+<summary>Download and load models from remote sources</summary>
+
+> If we load models during inference through the following settings:
+> 
+> ```python
+> model_configs=[
+>     ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+>     ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
+>     ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
+> ]
+> ```
+> 
+> Then during training, fill in the following parameters to load the corresponding models:
+> 
+> ```shell
+> --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors"
+> ```
+> 
+> Model files are downloaded to the `./models` path by default, which can be modified through [environment variable DIFFSYNTH_MODEL_BASE_PATH](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_model_base_path).
+> 
+> By default, even after models have been downloaded, the program will still query remotely for missing files. To completely disable remote requests, set [environment variable DIFFSYNTH_SKIP_DOWNLOAD](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_skip_download) to `True`.
+
+</details>
+
+<details>
+
+<details>
+
+<summary>Load models from local file paths</summary>
+
+> If loading models from local files during inference, for example:
+> 
+> ```python
+> model_configs=[
+>     ModelConfig([
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00001-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00002-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00003-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00004-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00005-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00006-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00007-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00008-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00009-of-00009.safetensors"
+>     ]),
+>     ModelConfig([
+>         "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors"
+>     ]),
+>     ModelConfig("models/Qwen/Qwen-Image/vae/diffusion_pytorch_model.safetensors")
+> ]
+> ```
+> 
+> Then during training, set to:
+> 
+> ```shell
+> --model_paths '[
+>     [
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00001-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00002-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00003-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00004-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00005-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00006-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00007-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00008-of-00009.safetensors",
+>         "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model-00009-of-00009.safetensors"
+>     ],
+>     [
+>         "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+>         "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors"
+>     ],
+>     "models/Qwen/Qwen-Image/vae/diffusion_pytorch_model.safetensors"
+> ]' \
+> ```
+> 
+> Note that `--model_paths` is in JSON format, and extra `,` cannot appear in it, otherwise it cannot be parsed normally.
+
+</details>
+
+## Setting Trainable Modules
+
+The training framework supports training of any model. Taking Qwen-Image as an example, to fully train the DiT model, set to:
+
+```shell
+--trainable_models "dit"
+```
+
+To train LoRA of the DiT model, set to:
+
+```shell
+--lora_base_model dit --lora_target_modules "to_q,to_k,to_v" --lora_rank 32
+```
+
+We hope to leave enough room for technical exploration, so the framework supports training any number of modules simultaneously. For example, to train the text encoder, controlnet, and LoRA of the DiT simultaneously:
+
+```shell
+--trainable_models "text_encoder,controlnet" --lora_base_model dit --lora_target_modules "to_q,to_k,to_v" --lora_rank 32
+```
+
+Additionally, since the training script loads multiple modules (text encoder, dit, vae, etc.), prefixes need to be removed when saving model files. For example, when fully training the DiT part or training the LoRA model of the DiT part, please set `--remove_prefix_in_ckpt pipe.dit.`. If multiple modules are trained simultaneously, developers need to write code to split the state dict in the model file after training is completed.
+
+## Starting the Training Program
+
+The training framework is built on [`accelerate`](https://huggingface.co/docs/accelerate/index). Training commands are written in the following format:
+
+```shell
+accelerate launch xxx/train.py \
+  --xxx yyy \
+  --xxxx yyyy
+```
+
+We have written preset training scripts for each model. See the documentation for each model for details.
+
+By default, `accelerate` will train according to the configuration in `~/.cache/huggingface/accelerate/default_config.yaml`. Use `accelerate config` to configure interactively in the terminal, including multi-GPU training, [`DeepSpeed`](https://www.deepspeed.ai/), etc.
+
+We provide recommended `accelerate` configuration files for some models, which can be set through `--config_file`. For example, full training of the Qwen-Image model:
+
+```shell
+accelerate launch --config_file examples/qwen_image/model_training/full/accelerate_config_zero2offload.yaml examples/qwen_image/model_training/train.py \
+  --dataset_base_path data/example_image_dataset \
+  --dataset_metadata_path data/example_image_dataset/metadata.csv \
+  --max_pixels 1048576 \
+  --dataset_repeat 50 \
+  --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
+  --learning_rate 1e-5 \
+  --num_epochs 2 \
+  --remove_prefix_in_ckpt "pipe.dit." \
+  --output_path "./models/train/Qwen-Image_full" \
+  --trainable_models "dit" \
+  --use_gradient_checkpointing \
+  --find_unused_parameters
+```
+
+## Training Considerations
+
+* In addition to the `csv` format, dataset metadata also supports `json` and `jsonl` formats. For how to choose the best metadata format, please refer to [/docs/en/API_Reference/core/data.md#metadata](/docs/en/API_Reference/core/data.md#metadata)
+* Training effectiveness is usually strongly correlated with training steps and weakly correlated with epoch count. Therefore, we recommend using the `--save_steps` parameter to save model files at training step intervals.
+* When data volume * `dataset_repeat` exceeds $10^9$, we observed that the dataset speed becomes significantly slower, which seems to be a `PyTorch` bug. We are not sure if newer versions of `PyTorch` have fixed this issue.
+* For learning rate `--learning_rate`, it is recommended to set to `1e-4` in LoRA training and `1e-5` in full training.
+* The training framework does not support batch size > 1. The reasons are complex. See [Q&A: Why doesn't the training framework support batch size > 1?](/docs/en/QA.md#why-doesnt-the-training-framework-support-batch-size--1)
+* Some models contain redundant parameters. For example, the text encoding part of the last layer of Qwen-Image's DiT part. When training these models, `--find_unused_parameters` needs to be set to avoid errors in multi-GPU training. For compatibility with community models, we do not intend to remove these redundant parameters.
+* The loss function value of Diffusion models has little relationship with actual effects. Therefore, we do not record loss function values during training. We recommend setting `--num_epochs` to a sufficiently large value, testing while training, and manually closing the training program after the effect converges.
+* `--use_gradient_checkpointing` is usually enabled unless GPU VRAM is sufficient; `--use_gradient_checkpointing_offload` is enabled as needed. See [`diffsynth.core.gradient`](/docs/en/API_Reference/core/gradient.md) for details.
--- a/docs/en/Pipeline_Usage/Setup.md
+++ b/docs/en/Pipeline_Usage/Setup.md
@@ -0,0 +1,21 @@
+# Installing Dependencies
+
+Install from source (recommended):
+
+```
+git clone https://github.com/modelscope/DiffSynth-Studio.git
+cd DiffSynth-Studio
+pip install -e .
+```
+
+Install from PyPI (there may be delays in version updates; for latest features, install from source):
+
+```
+pip install diffsynth
+```
+
+If you encounter issues during installation, they may be caused by upstream dependency packages. Please refer to the documentation for these packages:
+
+* [torch](https://pytorch.org/get-started/locally/)
+* [sentencepiece](https://github.com/google/sentencepiece)
+* [cmake](https://cmake.org)
--- a/docs/en/Pipeline_Usage/VRAM_management.md
+++ b/docs/en/Pipeline_Usage/VRAM_management.md
@@ -0,0 +1,206 @@
+# VRAM Management
+
+VRAM management is a distinctive feature of `DiffSynth-Studio` that enables GPUs with low VRAM to run inference with large parameter models. This document uses Qwen-Image as an example to introduce how to use the VRAM management solution.
+
+## Basic Inference
+
+The following code does not enable any VRAM management, occupying 56G VRAM as a reference.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+## CPU Offload
+
+Since the model `Pipeline` consists of multiple components that are not called simultaneously, we can move some components to memory when they are not needed for computation, reducing VRAM usage. The following code implements this logic, occupying 40G VRAM.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+vram_config = {
+    "offload_dtype": torch.bfloat16,
+    "offload_device": "cpu",
+    "onload_dtype": torch.bfloat16,
+    "onload_device": "cuda",
+    "preparing_dtype": torch.bfloat16,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+## FP8 Quantization
+
+Building upon CPU Offload, we further enable FP8 quantization to reduce VRAM requirements. The following code allows model parameters to be stored in VRAM with FP8 precision and temporarily converted to BF16 precision for computation during inference, occupying 21G VRAM. However, this quantization scheme has minor image quality degradation issues.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+vram_config = {
+    "offload_dtype": torch.float8_e4m3fn,
+    "offload_device": "cpu",
+    "onload_dtype": torch.float8_e4m3fn,
+    "onload_device": "cuda",
+    "preparing_dtype": torch.float8_e4m3fn,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+> Q: Why temporarily convert to BF16 precision during inference instead of computing with FP8 precision?
+> 
+> A: Native FP8 computation is only supported on Hopper architecture GPUs (such as H20) and has significant computational errors. We currently do not enable FP8 precision computation. The current FP8 quantization only reduces VRAM usage but does not improve computation speed.
+
+## Dynamic VRAM Management
+
+In CPU Offload, we control model components. In fact, we support Layer-level Offload, splitting a model into multiple Layers, keeping some resident in VRAM and storing others in memory for on-demand transfer to VRAM for computation. This feature requires model developers to provide detailed VRAM management solutions for each model. Related configurations are in `diffsynth/configs/vram_management_module_maps.py`.
+
+By adding the `vram_limit` parameter to the `Pipeline`, the framework can automatically sense the remaining VRAM of the device and decide how to split the model between VRAM and memory. The smaller the `vram_limit`, the less VRAM occupied, but slower the speed.
+* When `vram_limit=None`, the default state, the framework assumes unlimited VRAM and dynamic VRAM management is disabled
+* When `vram_limit=10`, the framework will limit the model after VRAM usage exceeds 10G, moving the excess parts to memory storage
+* When `vram_limit=0`, the framework will do its best to reduce VRAM usage, storing all model parameters in memory and transferring them to VRAM for computation only when necessary
+
+When VRAM is insufficient to run model inference, the framework will attempt to exceed the `vram_limit` restriction to keep the model inference running. Therefore, the VRAM management framework cannot always guarantee that VRAM usage will be less than `vram_limit`. We recommend setting it to slightly less than the actual available VRAM. For example, when GPU VRAM is 16G, set it to `vram_limit=15.5`. In `PyTorch`, you can use `torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3)` to get the GPU's VRAM.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+vram_config = {
+    "offload_dtype": torch.float8_e4m3fn,
+    "offload_device": "cpu",
+    "onload_dtype": torch.float8_e4m3fn,
+    "onload_device": "cpu",
+    "preparing_dtype": torch.float8_e4m3fn,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+## Disk Offload
+
+In more extreme cases, when memory is also insufficient to store the entire model, the Disk Offload feature allows lazy loading of model parameters, meaning each Layer of the model only reads the corresponding parameters from disk when the forward function is called. When enabling this feature, we recommend using high-speed SSD drives.
+
+Disk Offload is a very special VRAM management solution that only supports `.safetensors` format files, not `.bin`, `.pth`, `.ckpt`, or other binary files, and does not support [state dict converter](/docs/en/Developer_Guide/Integrating_Your_Model.md#step-2-model-file-format-conversion) with Tensor reshape.
+
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+import torch
+
+vram_config = {
+    "offload_dtype": "disk",
+    "offload_device": "disk",
+    "onload_dtype": "disk",
+    "onload_device": "disk",
+    "preparing_dtype": torch.float8_e4m3fn,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
+    ],
+    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
+    vram_limit=10,
+)
+prompt = "Exquisite portrait, underwater girl, blue dress flowing, hair floating, translucent light, bubbles surrounding, peaceful face, intricate details, dreamy and ethereal."
+image = pipe(prompt, seed=0, num_inference_steps=40)
+image.save("image.jpg")
+```
+
+## More Usage Methods
+
+Information in `vram_config` can be filled in manually, for example, Disk Offload without FP8 quantization:
+
+```python
+vram_config = {
+    "offload_dtype": "disk",
+    "offload_device": "disk",
+    "onload_dtype": "disk",
+    "onload_device": "disk",
+    "preparing_dtype": torch.bfloat16,
+    "preparing_device": "cuda",
+    "computation_dtype": torch.bfloat16,
+    "computation_device": "cuda",
+}
+```
+
+Specifically, the VRAM management module divides model Layers into the following four states:
+
+* Offload: This model will not be called in the short term. This state is controlled by switching `Pipeline`
+* Onload: This model will be called at any time soon. This state is controlled by switching `Pipeline`
+* Preparing: Intermediate state between Onload and Computation. A temporary storage state when VRAM allows. This state is controlled by the VRAM management mechanism and enters this state if and only if [vram_limit is set to unlimited] or [vram_limit is set and there is spare VRAM]
+* Computation: The model is being computed. This state is controlled by the VRAM management mechanism and is temporarily entered only during `forward`
+
+If you are a model developer and want to control the VRAM management granularity of a specific model, please refer to [../Developer_Guide/Enabling_VRAM_management.md](/docs/en/Developer_Guide/Enabling_VRAM_management.md).
+
+## Best Practices
+
+* Sufficient VRAM -> Use [Basic Inference](#basic-inference)
+* Insufficient VRAM
+    * Sufficient memory -> Use [Dynamic VRAM Management](#dynamic-vram-management)
+    * Insufficient memory -> Use [Disk Offload](#disk-offload)