DiffSynth-Studio 2.0 major update

2026-03-19 06:48:12 +00:00 · 2025-12-04 16:33:07 +08:00
parent afd101f345
commit 72af7122b3
758 changed files with 26462 additions and 2221398 deletions
--- a/docs/en/API_Reference/core/attention.md
+++ b/docs/en/API_Reference/core/attention.md
@@ -0,0 +1,79 @@
+# `diffsynth.core.attention`: Attention Mechanism Implementation
+
+`diffsynth.core.attention` provides routing mechanisms for attention mechanism implementations, automatically selecting efficient attention implementations based on available packages in the `Python` environment and [environment variables](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_attention_implementation).
+
+## Attention Mechanism
+
+The attention mechanism is a model structure proposed in the paper ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). In the original paper, the attention mechanism is implemented according to the following formula:
+
+$$
+\text{Attention}(Q, K, V) = \text{Softmax}\left(
+    \frac{QK^T}{\sqrt{d_k}}
+\right)
+V.
+$$
+
+In `PyTorch`, it can be implemented with the following code:
+```python
+import torch
+
+def attention(query, key, value):
+    scale_factor = 1 / query.size(-1)**0.5
+    attn_weight = query @ key.transpose(-2, -1) * scale_factor
+    attn_weight = torch.softmax(attn_weight, dim=-1)
+    return attn_weight @ value
+
+query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+output_1 = attention(query, key, value)
+```
+
+The dimensions of `query`, `key`, and `value` are $(b, n, s, d)$:
+* $b$: Batch size
+* $n$: Number of attention heads
+* $s$: Sequence length
+* $d$: Dimension of each attention head
+
+This computation does not include any trainable parameters. Modern transformer architectures will pass through Linear layers before and after this computation, but the "attention mechanism" discussed in this article refers only to the computation in the above code, not including these calculations.
+
+## More Efficient Implementations
+
+Note that the dimension of the Attention Score in the attention mechanism ( $\text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ in the formula, `attn_weight` in the code) is $(b, n, s, s)$, where the sequence length $s$ is typically very large, causing the time and space complexity of computation to reach quadratic level. Taking image generation models as an example, when the width and height of the image increase to 2 times, the sequence length increases to 4 times, and the computational load and memory requirements increase to 16 times. To avoid high computational costs, more efficient attention mechanism implementations are needed, including:
+* Flash Attention 3: [GitHub](https://github.com/Dao-AILab/flash-attention), [Paper](https://arxiv.org/abs/2407.08608)
+* Flash Attention 2: [GitHub](https://github.com/Dao-AILab/flash-attention), [Paper](https://arxiv.org/abs/2307.08691)
+* Sage Attention: [GitHub](https://github.com/thu-ml/SageAttention), [Paper](https://arxiv.org/abs/2505.11594)
+* xFormers: [GitHub](https://github.com/facebookresearch/xformers), [Documentation](https://facebookresearch.github.io/xformers/components/ops.html#module-xformers.ops)
+* PyTorch: [GitHub](https://github.com/pytorch/pytorch), [Documentation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
+
+To call attention implementations other than `PyTorch`, please follow the instructions on their GitHub pages to install the corresponding packages. `DiffSynth-Studio` will automatically route to the corresponding implementation based on available packages in the Python environment, or can be controlled through [environment variables](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_attention_implementation).
+
+```python
+from diffsynth.core.attention import attention_forward
+import torch
+
+def attention(query, key, value):
+    scale_factor = 1 / query.size(-1)**0.5
+    attn_weight = query @ key.transpose(-2, -1) * scale_factor
+    attn_weight = torch.softmax(attn_weight, dim=-1)
+    return attn_weight @ value
+
+query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="cuda")
+output_1 = attention(query, key, value)
+output_2 = attention_forward(query, key, value)
+print((output_1 - output_2).abs().mean())
+```
+
+Please note that acceleration will introduce errors, but in most cases, the error is negligible.
+
+## Developer Guide
+
+When integrating new models into `DiffSynth-Studio`, developers can decide whether to call `attention_forward` in `diffsynth.core.attention`, but we expect models to prioritize calling this module as much as possible, so that new attention mechanism implementations can take effect directly on these models.
+
+## Best Practices
+
+**In most cases, we recommend directly using the native `PyTorch` implementation without installing any additional packages.** Although other attention mechanism implementations can accelerate, the acceleration effect is relatively limited, and in a few cases, compatibility and precision issues may arise.
+
+In addition, efficient attention mechanism implementations will gradually be integrated into `PyTorch`. The `scaled_dot_product_attention` in `PyTorch` version 2.9.0 has already integrated Flash Attention 2. We still provide this interface in `DiffSynth-Studio` to allow some aggressive acceleration schemes to quickly move toward application, even though they still need time to be verified for stability.
--- a/docs/en/API_Reference/core/data.md
+++ b/docs/en/API_Reference/core/data.md
@@ -0,0 +1,151 @@
+# `diffsynth.core.data`: Data Processing Operators and Universal Dataset
+
+## Data Processing Operators
+
+### Available Data Processing Operators
+
+`diffsynth.core.data` provides a series of data processing operators for data processing, including:
+
+* Data format conversion operators
+    * `ToInt`: Convert to int format
+    * `ToFloat`: Convert to float format
+    * `ToStr`: Convert to str format
+    * `ToList`: Convert to list format, wrapping this data in a list
+    * `ToAbsolutePath`: Convert relative paths to absolute paths
+* File loading operators
+    * `LoadImage`: Read image files
+    * `LoadVideo`: Read video files
+    * `LoadAudio`: Read audio files
+    * `LoadGIF`: Read GIF files
+    * `LoadTorchPickle`: Read binary files saved by [`torch.save`](https://docs.pytorch.org/docs/stable/generated/torch.save.html) [This operator may cause code injection attacks in binary files, please use with caution!]
+* Media file processing operators
+    * `ImageCropAndResize`: Crop and resize images
+* Meta operators
+    * `SequencialProcess`: Route each data in the sequence to an operator
+    * `RouteByExtensionName`: Route to specific operators by file extension
+    * `RouteByType`: Route to specific operators by data type
+
+### Operator Usage
+
+Data operators are connected with the `>>` symbol to form data processing pipelines, for example:
+
+```python
+from diffsynth.core.data.operators import *
+
+data = "image.jpg"
+data_pipeline = ToAbsolutePath(base_path="/data") >> LoadImage() >> ImageCropAndResize(max_pixels=512*512)
+data = data_pipeline(data)
+```
+
+After passing through each operator, the data is processed in sequence:
+
+* `ToAbsolutePath(base_path="/data")`: `"/data/image.jpg"`
+* `LoadImage()`: `<PIL.Image.Image image mode=RGB size=1024x1024 at 0x7F8E7AAEFC10>`
+* `ImageCropAndResize(max_pixels=512*512)`: `<PIL.Image.Image image mode=RGB size=512x512 at 0x7F8E7A936F20>`
+
+We can compose functionally complete data pipelines, for example, the default video data operator for the universal dataset is:
+
+```python
+RouteByType(operator_map=[
+    (str, ToAbsolutePath(base_path) >> RouteByExtensionName(operator_map=[
+        (("jpg", "jpeg", "png", "webp"), LoadImage() >> ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor) >> ToList()),
+        (("gif",), LoadGIF(
+            num_frames, time_division_factor, time_division_remainder,
+            frame_processor=ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor),
+        )),
+        (("mp4", "avi", "mov", "wmv", "mkv", "flv", "webm"), LoadVideo(
+            num_frames, time_division_factor, time_division_remainder,
+            frame_processor=ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor),
+        )),
+    ])),
+])
+```
+
+It includes the following logic:
+
+* If the data is of type `str`
+    * If it's a `"jpg", "jpeg", "png", "webp"` type file
+        * Load this image
+        * Crop and scale to a specific resolution
+        * Pack into a list, treating it as a single-frame video
+    * If it's a `"gif"` type file
+        * Load the GIF file content
+        * Crop and scale each frame to a specific resolution
+    * If it's a `"mp4", "avi", "mov", "wmv", "mkv", "flv", "webm"` type file
+        * Load the video file content
+        * Crop and scale each frame to a specific resolution
+* If the data is not of type `str`, an error is reported
+
+## Universal Dataset
+
+`diffsynth.core.data` provides a unified dataset implementation. The dataset requires the following parameters:
+
+* `base_path`: Root directory. If the dataset contains relative paths to image files, this field needs to be filled in to load the files pointed to by these paths
+* `metadata_path`: Metadata directory, records the file paths of all metadata, supports `csv`, `json`, `jsonl` formats
+* `repeat`: Data repetition count, defaults to 1, this parameter affects the number of training steps in an epoch
+* `data_file_keys`: Data field names that need to be loaded, for example `(image, edit_image)`
+* `main_data_operator`: Main loading operator, needs to assemble the data processing pipeline through data processing operators
+* `special_operator_map`: Special operator mapping, operator mappings built for fields that require special processing
+
+### Metadata
+
+The dataset's `metadata_path` points to a metadata file, supporting `csv`, `json`, `jsonl` formats. The following provides examples:
+
+* `csv` format: High readability, does not support list data, small memory footprint
+
+```csv
+image,prompt
+image_1.jpg,"a dog"
+image_2.jpg,"a cat"
+```
+
+* `json` format: High readability, supports list data, large memory footprint
+
+```json
+[
+    {
+        "image": "image_1.jpg",
+        "prompt": "a dog"
+    },
+    {
+        "image": "image_2.jpg",
+        "prompt": "a cat"
+    }
+]
+```
+
+* `jsonl` format: Low readability, supports list data, small memory footprint
+
+```json
+{"image": "image_1.jpg", "prompt": "a dog"}
+{"image": "image_2.jpg", "prompt": "a cat"}
+```
+
+How to choose the best metadata format?
+
+* If the data volume is large, reaching tens of millions, since `json` file parsing requires additional memory, it's not available. Please use `csv` or `jsonl` format
+* If the dataset contains list data, such as edit models that require multiple images as input, since `csv` format cannot store list format data, it's not available. Please use `json` or `jsonl` format
+
+### Data Loading Logic
+
+When no additional settings are made, the dataset defaults to outputting data from the metadata set. Image and video file paths will be output in string format. To load these files, you need to set `data_file_keys`, `main_data_operator`, and `special_operator_map`.
+
+In the data processing flow, processing is done according to the following logic:
+* If the field is in `special_operator_map`, call the corresponding operator in `special_operator_map` for processing
+* If the field is not in `special_operator_map`
+    * If the field is in `data_file_keys`, call the `main_data_operator` operator for processing
+    * If the field is not in `data_file_keys`, no processing is done
+
+`special_operator_map` can be used to implement special data processing. For example, in the model [Wan-AI/Wan2.2-Animate-14B](https://www.modelscope.cn/models/Wan-AI/Wan2.2-Animate-14B), the input character face video `animate_face_video` is processed at a fixed resolution, inconsistent with the output video. Therefore, this field is processed by a dedicated operator:
+
+```python
+special_operator_map={
+    "animate_face_video": ToAbsolutePath(args.dataset_base_path) >> LoadVideo(args.num_frames, 4, 1, frame_processor=ImageCropAndResize(512, 512, None, 16, 16)),
+}
+```
+
+### Other Notes
+
+When the data volume is too small, you can appropriately increase `repeat` to extend the training time of a single epoch, avoiding frequent model saving that generates considerable overhead.
+
+When data volume * `repeat` exceeds $10^9$, we observe that the dataset speed becomes significantly slower. This seems to be a `PyTorch` bug, and we are not sure if newer versions of `PyTorch` have fixed this issue.
--- a/docs/en/API_Reference/core/gradient.md
+++ b/docs/en/API_Reference/core/gradient.md
@@ -0,0 +1,69 @@
+# `diffsynth.core.gradient`: Gradient Checkpointing and Offload
+
+`diffsynth.core.gradient` provides encapsulated gradient checkpointing and its Offload version for model training.
+
+## Gradient Checkpointing
+
+Gradient checkpointing is a technique used to reduce memory usage during training. We provide an example to help you understand this technique. Here is a simple model structure:
+
+```python
+import torch
+
+class ToyModel(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.activation = torch.nn.Sigmoid()
+    
+    def forward(self, x):
+        return self.activation(x)
+
+model = ToyModel()
+x = torch.randn((2, 3))
+y = model(x)
+```
+
+In this model structure, the input parameter $x$ passes through the Sigmoid activation function to obtain the output value $y=\frac{1}{1+e^{-x}}$.
+
+During the training process, assuming our loss function value is $\mathcal L$, when backpropagating gradients, we obtain $\frac{\partial \mathcal L}{\partial y}$. At this point, we need to calculate $\frac{\partial \mathcal L}{\partial x}$. It's not difficult to find that $\frac{\partial y}{\partial x}=y(1-y)$, and thus $\frac{\partial \mathcal L}{\partial x}=\frac{\partial \mathcal L}{\partial y}\frac{\partial y}{\partial x}=\frac{\partial \mathcal L}{\partial y}y(1-y)$. If we save the value of $y$ during the model's forward propagation and directly compute $y(1-y)$ during gradient backpropagation, this will avoid complex exp computations, speeding up the calculation. However, this requires additional memory to store the intermediate variable $y$.
+
+When gradient checkpointing is not enabled, the training framework will default to storing all intermediate variables that assist gradient computation, thereby achieving optimal computational speed. When gradient checkpointing is enabled, intermediate variables are not stored, but the input parameter $x$ is still stored, reducing memory usage. During gradient backpropagation, these variables need to be recomputed, slowing down the calculation.
+
+## Enabling Gradient Checkpointing and Its Offload
+
+`gradient_checkpoint_forward` in `diffsynth.core.gradient` implements gradient checkpointing and its Offload. Refer to the following code for calling:
+
+```python
+import torch
+from diffsynth.core.gradient import gradient_checkpoint_forward
+
+class ToyModel(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.activation = torch.nn.Sigmoid()
+    
+    def forward(self, x):
+        return self.activation(x)
+
+model = ToyModel()
+x = torch.randn((2, 3))
+y = gradient_checkpoint_forward(
+    model,
+    use_gradient_checkpointing=True,
+    use_gradient_checkpointing_offload=False,
+    x=x,
+)
+```
+
+* When `use_gradient_checkpointing=False` and `use_gradient_checkpointing_offload=False`, the computation process is exactly the same as the original computation, not affecting the model's inference and training. You can directly integrate it into your code.
+* When `use_gradient_checkpointing=True` and `use_gradient_checkpointing_offload=False`, gradient checkpointing is enabled.
+* When `use_gradient_checkpointing_offload=True`, gradient checkpointing is enabled, and all gradient checkpoint input parameters are stored in memory, further reducing memory usage and slowing down computation.
+
+## Best Practices
+
+> Q: Where should gradient checkpointing be enabled?
+> 
+> A: When enabling gradient checkpointing for the entire model, computational efficiency and memory usage are not optimal. We need to set fine-grained gradient checkpoints, but we don't want to add too much complicated code to the framework. Therefore, we recommend implementing it in the `model_fn` of `Pipeline`, for example, `model_fn_qwen_image` in `diffsynth/pipelines/qwen_image.py`, enabling gradient checkpointing at the Block level without modifying any code in the model structure.
+
+> Q: When should gradient checkpointing be enabled?
+> 
+> A: As model parameters become increasingly large, gradient checkpointing has become a necessary training technique. Gradient checkpointing usually needs to be enabled. Gradient checkpointing Offload should only be enabled in models where activation values occupy excessive memory (such as video generation models).
--- a/docs/en/API_Reference/core/loader.md
+++ b/docs/en/API_Reference/core/loader.md
@@ -0,0 +1,141 @@
+# `diffsynth.core.loader`: Model Download and Loading
+
+This document introduces the model download and loading functionalities in `diffsynth.core.loader`.
+
+## ModelConfig
+
+`ModelConfig` in `diffsynth.core.loader` is used to annotate model download sources, local paths, VRAM management configurations, and other information.
+
+### Downloading and Loading Models from Remote Sources
+
+Taking the model [DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny) as an example, after filling in `model_id` and `origin_file_pattern` in `ModelConfig`, the model can be automatically downloaded. By default, it downloads to the `./models` path, which can be modified through the [environment variable DIFFSYNTH_MODEL_BASE_PATH](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_model_base_path).
+
+By default, even if the model has already been downloaded, the program will still query the remote for any missing files. To completely disable remote requests, set the [environment variable DIFFSYNTH_SKIP_DOWNLOAD](/docs/en/Pipeline_Usage/Environment_Variables.md#diffsynth_skip_download) to `True`.
+
+```python
+from diffsynth.core import ModelConfig
+
+config = ModelConfig(
+    model_id="DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny",
+    origin_file_pattern="model.safetensors",
+)
+# Download models
+config.download_if_necessary()
+print(config.path)
+```
+
+After calling `download_if_necessary`, the model will be automatically downloaded, and the path will be returned to `config.path`.
+
+### Loading Models from Local Paths
+
+If loading models from local paths, you need to fill in `path`:
+
+```python
+from diffsynth.core import ModelConfig
+
+config = ModelConfig(path="models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors")
+```
+
+If the model contains multiple shard files, input them in list form:
+
+```python
+from diffsynth.core import ModelConfig
+
+config = ModelConfig(path=[
+    "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors"
+])
+```
+
+### VRAM Management Configuration
+
+`ModelConfig` also contains VRAM management configuration information. See [VRAM Management](/docs/en/Pipeline_Usage/VRAM_management.md#more-usage-methods) for details.
+
+## Model File Loading
+
+`diffsynth.core.loader` provides a unified `load_state_dict` for loading state dicts from model files.
+
+Loading a single model file:
+
+```python
+from diffsynth.core import load_state_dict
+
+state_dict = load_state_dict("models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors")
+```
+
+Loading multiple model files (merged into one state dict):
+
+```python
+from diffsynth.core import load_state_dict
+
+state_dict = load_state_dict([
+    "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors"
+])
+```
+
+## Model Hash
+
+Model hash is used to determine the model type. The hash value can be obtained through `hash_model_file`:
+
+```python
+from diffsynth.core import hash_model_file
+
+print(hash_model_file("models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors"))
+```
+
+The hash value of multiple model files can also be calculated, which is equivalent to calculating the model hash value after merging the state dict:
+
+```python
+from diffsynth.core import hash_model_file
+
+print(hash_model_file([
+    "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
+    "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors"
+]))
+```
+
+The model hash value is only related to the keys and tensor shapes in the state dict of the model file, and is unrelated to the numerical values of the model parameters, file saving time, and other information. When calculating the model hash value of `.safetensors` format files, `hash_model_file` is almost instantly completed without reading the model parameters. However, when calculating the model hash value of `.bin`, `.pth`, `.ckpt`, and other binary files, all model parameters need to be read, so **we do not recommend developers to continue using these formats of files.**
+
+By [writing model Config](/docs/en/Developer_Guide/Integrating_Your_Model.md#step-3-writing-model-config) and filling in model hash value and other information into `diffsynth/configs/model_configs.py`, developers can let `DiffSynth-Studio` automatically identify the model type and load it.
+
+## Model Loading
+
+`load_model` is the external entry for loading models in `diffsynth.core.loader`. It will call [skip_model_initialization](/docs/en/API_Reference/core/vram.md#skipping-model-parameter-initialization) to skip model parameter initialization. If [Disk Offload](/docs/en/Pipeline_Usage/VRAM_management.md#disk-offload) is enabled, it calls [DiskMap](/docs/en/API_Reference/core/vram.md#state-dict-disk-mapping) for lazy loading. If Disk Offload is not enabled, it calls [load_state_dict](#model-file-loading) to load model parameters. If necessary, it will also call [state dict converter](/docs/en/Developer_Guide/Integrating_Your_Model.md#step-2-model-file-format-conversion) for model format conversion. Finally, it calls `model.eval()` to switch to inference mode.
+
+Here is a usage example with Disk Offload enabled:
+
+```python
+from diffsynth.core import load_model, enable_vram_management, AutoWrappedLinear, AutoWrappedModule
+from diffsynth.models.qwen_image_dit import QwenImageDiT, RMSNorm
+import torch
+
+prefix = "models/Qwen/Qwen-Image/transformer/diffusion_pytorch_model"
+model_path = [prefix + f"-0000{i}-of-00009.safetensors" for i in range(1, 10)]
+
+model = load_model(
+    QwenImageDiT,
+    model_path,
+    module_map={
+        torch.nn.Linear: AutoWrappedLinear,
+        RMSNorm: AutoWrappedModule,
+    },
+    vram_config={
+        "offload_dtype": "disk",
+        "offload_device": "disk",
+        "onload_dtype": "disk",
+        "onload_device": "disk",
+        "preparing_dtype": torch.bfloat16,
+        "preparing_device": "cuda",
+        "computation_dtype": torch.bfloat16,
+        "computation_device": "cuda",
+    },
+    vram_limit=0,
+)
+```
--- a/docs/en/API_Reference/core/vram.md
+++ b/docs/en/API_Reference/core/vram.md
@@ -0,0 +1,66 @@
+# `diffsynth.core.vram`: VRAM Management
+
+This document introduces the underlying VRAM management functionalities in `diffsynth.core.vram`. If you wish to use these functionalities in other codebases, you can refer to this document.
+
+## Skipping Model Parameter Initialization
+
+When loading models in `PyTorch`, model parameters default to occupying VRAM or memory and initializing parameters, but these parameters will be overwritten when loading pretrained weights, leading to redundant computations. `PyTorch` does not provide an interface to skip these redundant computations. We provide `skip_model_initialization` in `diffsynth.core.vram` to skip model parameter initialization.
+
+Default model loading approach:
+
+```python
+from diffsynth.core import load_state_dict
+from diffsynth.models.qwen_image_controlnet import QwenImageBlockWiseControlNet
+
+model = QwenImageBlockWiseControlNet() # Slow
+path = "models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors"
+state_dict = load_state_dict(path, device="cpu")
+model.load_state_dict(state_dict, assign=True)
+```
+
+Model loading approach that skips parameter initialization:
+
+```python
+from diffsynth.core import load_state_dict, skip_model_initialization
+from diffsynth.models.qwen_image_controlnet import QwenImageBlockWiseControlNet
+
+with skip_model_initialization():
+    model = QwenImageBlockWiseControlNet() # Fast
+path = "models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors"
+state_dict = load_state_dict(path, device="cpu")
+model.load_state_dict(state_dict, assign=True)
+```
+
+In `DiffSynth-Studio`, all pretrained models follow this loading logic. After developers [integrate models](/docs/en/Developer_Guide/Integrating_Your_Model.md), they can directly load models quickly using this approach.
+
+## State Dict Disk Mapping
+
+For pretrained weight files of a model, if we only need to read a set of parameters rather than all parameters, State Dict Disk Mapping can accelerate this process. We provide `DiskMap` in `diffsynth.core.vram` for on-demand loading of model parameters.
+
+Default weight loading approach:
+
+```python
+from diffsynth.core import load_state_dict
+
+path = "models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors"
+state_dict = load_state_dict(path, device="cpu") # Slow
+print(state_dict["img_in.weight"])
+```
+
+Using `DiskMap` to load only specific parameters:
+
+```python
+from diffsynth.core import DiskMap
+
+path = "models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny/model.safetensors"
+state_dict = DiskMap(path, device="cpu") # Fast
+print(state_dict["img_in.weight"])
+```
+
+`DiskMap` is the basic component of Disk Offload in `DiffSynth-Studio`. After developers [configure fine-grained VRAM management schemes](/docs/en/Developer_Guide/Enabling_VRAM_management.md), they can directly enable Disk Offload.
+
+`DiskMap` is a functionality implemented using the characteristics of `.safetensors` files. Therefore, when using `.bin`, `.pth`, `.ckpt`, and other binary files, model parameters are fully loaded, which causes Disk Offload to not support these formats of files. **We do not recommend developers to continue using these formats of files.**
+
+## Replacable Modules for VRAM Management
+
+When `DiffSynth-Studio`'s VRAM management is enabled, the modules inside the model will be replaced with replacable modules in `diffsynth.core.vram.layers`. For usage, see [Fine-grained VRAM Management Scheme](/docs/en/Developer_Guide/Enabling_VRAM_management.md#writing-fine-grained-vram-management-schemes).