Merge pull request #1169 from Feng0w0/sample_add

Docs:Supplement NPU training script samples and documentation instruction
This commit is contained in:
Zhongjie Duan
2026-01-12 10:08:38 +08:00
committed by GitHub
13 changed files with 286 additions and 9 deletions

View File

@@ -13,7 +13,7 @@ All sample code provided by this project supports NVIDIA GPUs by default, requir
AMD provides PyTorch packages based on ROCm, so most models can run without code changes. A small number of models may not be compatible due to their reliance on CUDA-specific instructions.
## Ascend NPU
### Inference
When using Ascend NPU, you need to replace `"cuda"` with `"npu"` in your code.
For example, here is the inference code for **Wan2.1-T2V-1.3B**, modified for Ascend NPU:
@@ -22,6 +22,7 @@ For example, here is the inference code for **Wan2.1-T2V-1.3B**, modified for As
import torch
from diffsynth.utils.data import save_video, VideoData
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
from diffsynth.core.device.npu_compatible_device import get_device_name
vram_config = {
"offload_dtype": "disk",
@@ -46,7 +47,7 @@ pipe = WanVideoPipeline.from_pretrained(
],
tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
- vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2,
+ vram_limit=torch.npu.mem_get_info("npu:0")[1] / (1024 ** 3) - 2,
+ vram_limit=torch.npu.mem_get_info(get_device_name())[1] / (1024 ** 3) - 2,
)
video = pipe(
@@ -56,3 +57,28 @@ video = pipe(
)
save_video(video, "video.mp4", fps=15, quality=5)
```
### Training
NPU startup script samples have been added for each type of model,the scripts are stored in the `examples/xxx/special/npu_scripts`, for example `examples/wanvideo/model_training/special/npu_scripts/Wan2.2-T2V-A14B-NPU.sh`.
In the NPU training scripts, NPU specific environment variables that can optimize performance have been added, and relevant parameters have been enabled for specific models.
#### Environment variables
```shell
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
```
`expandable_segments:<value>`: Enable the memory pool expansion segment function, which is the virtual memory feature.
```shell
export CPU_AFFINITY_CONF=1
```
Set 0 or not set: indicates not enabling the binding function
1: Indicates enabling coarse-grained kernel binding
2: Indicates enabling fine-grained kernel binding
#### Parameters for specific models
| Model | Parameter | Note |
|----------------|---------------------------|-------------------|
| Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU |

View File

@@ -13,7 +13,7 @@
AMD 提供了基于 ROCm 的 torch 包,所以大多数模型无需修改代码即可运行,少数模型由于依赖特定的 cuda 指令无法运行。
## Ascend NPU
### 推理
使用 Ascend NPU 时,需把代码中的 `"cuda"` 改为 `"npu"`
例如Wan2.1-T2V-1.3B 的推理代码:
@@ -22,6 +22,7 @@ AMD 提供了基于 ROCm 的 torch 包,所以大多数模型无需修改代码
import torch
from diffsynth.utils.data import save_video, VideoData
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
from diffsynth.core.device.npu_compatible_device import get_device_name
vram_config = {
"offload_dtype": "disk",
@@ -33,7 +34,7 @@ vram_config = {
+ "preparing_device": "npu",
"computation_dtype": torch.bfloat16,
- "computation_device": "cuda",
+ "preparing_device": "npu",
+ "computation_device": "npu",
}
pipe = WanVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
@@ -46,7 +47,7 @@ pipe = WanVideoPipeline.from_pretrained(
],
tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
- vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2,
+ vram_limit=torch.npu.mem_get_info("npu:0")[1] / (1024 ** 3) - 2,
+ vram_limit=torch.npu.mem_get_info(get_device_name())[1] / (1024 ** 3) - 2,
)
video = pipe(
@@ -56,3 +57,28 @@ video = pipe(
)
save_video(video, "video.mp4", fps=15, quality=5)
```
### 训练
当前已为每类模型添加NPU的启动脚本样例脚本存放在`examples/xxx/special/npu_scripts`目录下,例如 `examples/wanvideo/model_training/special/npu_scripts/Wan2.2-T2V-A14B-NPU.sh`
在NPU训练脚本中添加了可以优化性能的NPU特有环境变量并针对特定模型开启了相关参数。
#### 环境变量
```shell
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
```
`expandable_segments:<value>`: 使能内存池扩展段功能,即虚拟内存特征。
```shell
export CPU_AFFINITY_CONF=1
```
设置0或未设置: 表示不启用绑核功能
1: 表示开启粗粒度绑核
2: 表示开启细粒度绑核
#### 特定模型需要开启的参数
| 模型 | 参数 | 备注 |
|-----------|------|-------------------|
| Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 |