support z-image-omni-base

2026-04-08 08:58:20 +00:00 · 2026-01-05 14:03:15 +08:00
128 changed files with 369 additions and 3538 deletions
--- a/.github/workflows/publish.yaml
+++ b/.github/workflows/publish.yaml
@@ -22,7 +22,7 @@ jobs:
      - name: Install wheel
        run: pip install wheel==0.44.0 && pip install -r requirements.txt
      - name: Build DiffSynth
-        run: python -m build
+        run: python setup.py sdist bdist_wheel
      - name: Publish package to PyPI
        run: |
          pip install twine
--- a/README.md
+++ b/README.md
@@ -33,12 +33,6 @@ We believe that a well-developed open-source code framework can lower the thresh

 > Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.

- **January 19, 2026**: Added support for [FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) and [FLUX.2-klein-9B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B) models, including training and inference capabilities. [Documentation](/docs/en/Model_Details/FLUX2.md) and [example code](/examples/flux2/) are now available.
-
- **January 12, 2026**: We trained and open-sourced a text-guided image layer separation model ([Model Link](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)). Given an input image and a textual description, the model isolates the image layer corresponding to the described content. For more details, please refer to our blog post ([Chinese version](https://modelscope.cn/learn/4938), [English version](https://huggingface.co/blog/kelseye/qwen-image-layered-control)).
-
- **December 24, 2025**: Based on Qwen-Image-Edit-2511, we trained an In-Context Editing LoRA model ([Model Link](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Edit-2511-ICEdit-LoRA)). This model takes three images as input (Image A, Image B, and Image C), and automatically analyzes the transformation from Image A to Image B, then applies the same transformation to Image C to generate Image D. For more details, please refer to our blog post ([Chinese version](https://mp.weixin.qq.com/s/41aEiN3lXKGCJs1-we4Q2g), [English version](https://huggingface.co/blog/kelseye/qwen-image-edit-2511-icedit-lora)).
-
 - **December 9, 2025** We release a wild model based on DiffSynth-Studio 2.0: [Qwen-Image-i2L](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-i2L) (Image-to-LoRA). This model takes an image as input and outputs a LoRA. Although this version still has significant room for improvement in terms of generalization, detail preservation, and other aspects, we are open-sourcing these models to inspire more innovative research. For more details, please refer to our [blog](https://huggingface.co/blog/kelseye/qwen-image-i2l).

 - **December 4, 2025** DiffSynth-Studio 2.0 released! Many new features online
@@ -321,13 +315,9 @@ image.save("image.jpg")

 Example code for FLUX.2 is available at: [/examples/flux2/](/examples/flux2/)

-| Model ID | Inference | Low-VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
-|-|-|-|-|-|-|-|
-|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|-|-|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|
-|[black-forest-labs/FLUX.2-klein-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py)|
-|[black-forest-labs/FLUX.2-klein-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py)|
-|[black-forest-labs/FLUX.2-klein-base-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py)|
-|[black-forest-labs/FLUX.2-klein-base-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py)|
+| Model ID | Inference | Low-VRAM Inference | LoRA Training | LoRA Training Validation |
+|-|-|-|-|-|
+|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|

 </details>

@@ -411,7 +401,6 @@ Example code for Qwen-Image is available at: [/examples/qwen_image/](/examples/q
 |[Qwen/Qwen-Image-Edit-2509](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2509)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2509.py)|
 |[Qwen/Qwen-Image-Edit-2511](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2511)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2511.py)|
 |[Qwen/Qwen-Image-Layered](https://www.modelscope.cn/models/Qwen/Qwen-Image-Layered)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered.py)|
-|[DiffSynth-Studio/Qwen-Image-Layered-Control](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-V2.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-Poster](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-Poster)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-Poster.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-Poster.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen-Poster.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen-Poster.py)|
@@ -780,3 +769,4 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/b54c05c5-d747-47
 https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-4481-b79f-0c3a7361a1ea

 </details>
+
--- a/README_zh.md
+++ b/README_zh.md
@@ -33,12 +33,6 @@ DiffSynth 目前包括两个开源项目：

 > 目前本项目的开发人员有限，大部分工作由 [Artiprocher](https://github.com/Artiprocher) 负责，因此新功能的开发进展会比较缓慢，issue 的回复和解决速度有限，我们对此感到非常抱歉，请各位开发者理解。

- **2026年1月19日** 新增对 [FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) 和 [FLUX.2-klein-9B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B) 模型的支持，包括完整的训练和推理功能。[文档](/docs/zh/Model_Details/FLUX2.md)和[示例代码](/examples/flux2/)现已可用。
-
- **2026年1月12日** 我们训练并开源了一个文本引导的图层拆分模型（[模型链接](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)），这一模型输入一张图与一段文本描述，模型会将图像中与文本描述相关的图层拆分出来。更多细节请阅读我们的 blog（[中文版](https://modelscope.cn/learn/4938)、[英文版](https://huggingface.co/blog/kelseye/qwen-image-layered-control)）。
-
- **2025年12月24日** 我们基于 Qwen-Image-Edit-2511 训练了一个 In-Context Editing LoRA 模型（[模型链接](https://modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Edit-2511-ICEdit-LoRA)），这个模型可以输入三张图：图A、图B、图C，模型会自行分析图A到图B的变化，并将这样的变化应用到图C，生成图D。更多细节请阅读我们的 blog（[中文版](https://mp.weixin.qq.com/s/41aEiN3lXKGCJs1-we4Q2g)、[英文版](https://huggingface.co/blog/kelseye/qwen-image-edit-2511-icedit-lora)）。
-
 - **2025年12月9日** 我们基于 DiffSynth-Studio 2.0 训练了一个疯狂的模型：[Qwen-Image-i2L](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-i2L)（Image to LoRA）。这一模型以图像为输入，以 LoRA 为输出。尽管这个版本的模型在泛化能力、细节保持能力等方面还有很大改进空间，我们将这些模型开源，以启发更多创新性的研究工作。更多细节，请参考我们的 [blog](https://huggingface.co/blog/kelseye/qwen-image-i2l)。

 - **2025年12月4日** DiffSynth-Studio 2.0 发布！众多新功能上线
@@ -321,13 +315,9 @@ image.save("image.jpg")

 FLUX.2 的示例代码位于：[/examples/flux2/](/examples/flux2/)

-|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
-|-|-|-|-|-|-|-|
-|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|-|-|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|
-|[black-forest-labs/FLUX.2-klein-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py)|
-|[black-forest-labs/FLUX.2-klein-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py)|
-|[black-forest-labs/FLUX.2-klein-base-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py)|
-|[black-forest-labs/FLUX.2-klein-base-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py)|
+|模型 ID|推理|低显存推理|LoRA 训练|LoRA 训练后验证|
+|-|-|-|-|-|
+|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|

 </details>

@@ -411,7 +401,6 @@ Qwen-Image 的示例代码位于：[/examples/qwen_image/](/examples/qwen_image/
 |[Qwen/Qwen-Image-Edit-2509](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2509)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2509.py)|
 |[Qwen/Qwen-Image-Edit-2511](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2511)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2511.py)|
 |[Qwen/Qwen-Image-Layered](https://www.modelscope.cn/models/Qwen/Qwen-Image-Layered)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered.py)|
-|[DiffSynth-Studio/Qwen-Image-Layered-Control](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-V2.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-Poster](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-Poster)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-Poster.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-Poster.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen-Poster.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen-Poster.py)|
--- a/diffsynth/configs/model_configs.py
+++ b/diffsynth/configs/model_configs.py
@@ -317,13 +317,6 @@ flux_series = [
        "model_class": "diffsynth.models.flux_dit.FluxDiT",
        "state_dict_converter": "diffsynth.utils.state_dict_converters.flux_dit.FluxDiTStateDictConverter",
    },
-    {
-        # Supported due to historical reasons.
-        "model_hash": "605c56eab23e9e2af863ad8f0813a25d",
-        "model_name": "flux_dit",
-        "model_class": "diffsynth.models.flux_dit.FluxDiT",
-        "state_dict_converter": "diffsynth.utils.state_dict_converters.flux_dit.FluxDiTStateDictConverterFromDiffusers",
-    },
    {
        # Example: ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors")
        "model_hash": "94eefa3dac9cec93cb1ebaf1747d7b78",
@@ -481,13 +474,6 @@ flux_series = [
        "state_dict_converter": "diffsynth.utils.state_dict_converters.flux_dit.FluxDiTStateDictConverter",
        "extra_kwargs": {"disable_guidance_embedder": True},
    },
-    {
-        # Example: ModelConfig(model_id="MAILAND/majicflus_v1", origin_file_pattern="majicflus_v134.safetensors")
-        "model_hash": "3394f306c4cbf04334b712bf5aaed95f",
-        "model_name": "flux_dit",
-        "model_class": "diffsynth.models.flux_dit.FluxDiT",
-        "state_dict_converter": "diffsynth.utils.state_dict_converters.flux_dit.FluxDiTStateDictConverter",
-    },
 ]

 flux2_series = [
@@ -510,28 +496,6 @@ flux2_series = [
        "model_name": "flux2_vae",
        "model_class": "diffsynth.models.flux2_vae.Flux2VAE",
    },
-    {
-        # Example: ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors")
-        "model_hash": "3bde7b817fec8143028b6825a63180df",
-        "model_name": "flux2_dit",
-        "model_class": "diffsynth.models.flux2_dit.Flux2DiT",
-        "extra_kwargs": {"guidance_embeds": False, "joint_attention_dim": 7680, "num_attention_heads": 24, "num_layers": 5, "num_single_layers": 20}
-    },
-    {
-        # Example: ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors")
-        "model_hash": "9195f3ea256fcd0ae6d929c203470754",
-        "model_name": "z_image_text_encoder",
-        "model_class": "diffsynth.models.z_image_text_encoder.ZImageTextEncoder",
-        "extra_kwargs": {"model_size": "8B"},
-        "state_dict_converter": "diffsynth.utils.state_dict_converters.z_image_text_encoder.ZImageTextEncoderStateDictConverter",
-    },
-    {
-        # Example: ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="transformer/*.safetensors")
-        "model_hash": "39c6fc48f07bebecedbbaa971ff466c8",
-        "model_name": "flux2_dit",
-        "model_class": "diffsynth.models.flux2_dit.Flux2DiT",
-        "extra_kwargs": {"guidance_embeds": False, "joint_attention_dim": 12288, "num_attention_heads": 32, "num_layers": 8, "num_single_layers": 24}
-    },
 ]

 z_image_series = [
@@ -576,19 +540,6 @@ z_image_series = [
        "model_name": "siglip_vision_model_428m",
        "model_class": "diffsynth.models.siglip2_image_encoder.Siglip2ImageEncoder428M",
    },
-    {
-        # Example: ModelConfig(model_id="PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1", origin_file_pattern="Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors")
-        "model_hash": "1677708d40029ab380a95f6c731a57d7",
-        "model_name": "z_image_controlnet",
-        "model_class": "diffsynth.models.z_image_controlnet.ZImageControlNet",
-    },
-    {
-        # Example: ???
-        "model_hash": "9510cb8cd1dd34ee0e4f111c24905510",
-        "model_name": "z_image_image2lora_style",
-        "model_class": "diffsynth.models.z_image_image2lora.ZImageImage2LoRAModel",
-        "extra_kwargs": {"compress_dim": 128},
-    },
 ]

 MODEL_CONFIGS = qwen_image_series + wan_series + flux_series + flux2_series + z_image_series
--- a/diffsynth/configs/vram_management_module_maps.py
+++ b/diffsynth/configs/vram_management_module_maps.py
@@ -195,19 +195,4 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
        "diffsynth.models.z_image_dit.RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
    },
-    "diffsynth.models.z_image_controlnet.ZImageControlNet": {
-        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
-        "diffsynth.models.z_image_dit.RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
-    },
-    "diffsynth.models.z_image_image2lora.ZImageImage2LoRAModel": {
-        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
-    },
-    "diffsynth.models.siglip2_image_encoder.Siglip2ImageEncoder428M": {
-        "transformers.models.siglip2.modeling_siglip2.Siglip2VisionEmbeddings": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "transformers.models.siglip2.modeling_siglip2.Siglip2MultiheadAttentionPoolingHead": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "torch.nn.Conv2d": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
-        "torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
-    },
 }
--- a/diffsynth/core/data/unified_dataset.py
+++ b/diffsynth/core/data/unified_dataset.py
@@ -10,7 +10,6 @@ class UnifiedDataset(torch.utils.data.Dataset):
        data_file_keys=tuple(),
        main_data_operator=lambda x: x,
        special_operator_map=None,
-        max_data_items=None,
    ):
        self.base_path = base_path
        self.metadata_path = metadata_path
@@ -19,7 +18,6 @@ class UnifiedDataset(torch.utils.data.Dataset):
        self.main_data_operator = main_data_operator
        self.cached_data_operator = LoadTorchPickle()
        self.special_operator_map = {} if special_operator_map is None else special_operator_map
-        self.max_data_items = max_data_items
        self.data = []
        self.cached_data = []
        self.load_from_cache = metadata_path is None
@@ -99,9 +97,7 @@ class UnifiedDataset(torch.utils.data.Dataset):
        return data

    def __len__(self):
-        if self.max_data_items is not None:
-            return self.max_data_items
-        elif self.load_from_cache:
+        if self.load_from_cache:
            return len(self.cached_data) * self.repeat
        else:
            return len(self.data) * self.repeat
--- a/diffsynth/core/device/init.py
+++ b/diffsynth/core/device/init.py
@@ -1,2 +1 @@
-from .npu_compatible_device import parse_device_type, parse_nccl_backend, get_available_device_type, get_device_name
-from .npu_compatible_device import IS_NPU_AVAILABLE, IS_CUDA_AVAILABLE
+from .npu_compatible_device import parse_device_type, parse_nccl_backend, get_available_device_type
--- a/diffsynth/core/loader/config.py
+++ b/diffsynth/core/loader/config.py
@@ -97,7 +97,6 @@ class ModelConfig:
        self.reset_local_model_path()
        if self.require_downloading():
            self.download()
-        if self.path is None:
            if self.origin_file_pattern is None or self.origin_file_pattern == "":
                self.path = os.path.join(self.local_model_path, self.model_id)
            else:
--- a/diffsynth/core/loader/model.py
+++ b/diffsynth/core/loader/model.py
@@ -3,13 +3,14 @@ from ..vram.disk_map import DiskMap
 from ..vram.layers import enable_vram_management
 from .file import load_state_dict
 import torch
-from transformers.integrations import is_deepspeed_zero3_enabled
-from transformers.utils import ContextManagers


-def load_model(model_class, path, config=None, torch_dtype=torch.bfloat16, device="cpu", state_dict_converter=None, use_disk_map=False, module_map=None, vram_config=None, vram_limit=None, state_dict=None):
+def load_model(model_class, path, config=None, torch_dtype=torch.bfloat16, device="cpu", state_dict_converter=None, use_disk_map=False, module_map=None, vram_config=None, vram_limit=None):
    config = {} if config is None else config
-    with ContextManagers(get_init_context(torch_dtype=torch_dtype, device=device)):
+    # Why do we use `skip_model_initialization`?
+    # It skips the random initialization of model parameters,
+    # thereby speeding up model loading and avoiding excessive memory usage.
+    with skip_model_initialization():
        model = model_class(**config)
    # What is `module_map`?
    # This is a module mapping table for VRAM management.
@@ -45,14 +46,7 @@ def load_model(model_class, path, config=None, torch_dtype=torch.bfloat16, devic
            state_dict = state_dict_converter(state_dict)
        else:
            state_dict = {i: state_dict[i] for i in state_dict}
-        # Why does DeepSpeed ZeRO Stage 3 need to be handled separately?
-        # Because at this stage, model parameters are partitioned across multiple GPUs.
-        # Loading them directly could lead to excessive GPU memory consumption.
-        if is_deepspeed_zero3_enabled():
-            from transformers.integrations.deepspeed import _load_state_dict_into_zero3_model
-            _load_state_dict_into_zero3_model(model, state_dict)
-        else:
-            model.load_state_dict(state_dict, assign=True)
+        model.load_state_dict(state_dict, assign=True)
        # Why do we call `to()`?
        # Because some models override the behavior of `to()`,
        # especially those from libraries like Transformers.
@@ -83,20 +77,3 @@ def load_model_with_disk_offload(model_class, path, config=None, torch_dtype=tor
    }
    enable_vram_management(model, module_map, vram_config=vram_config, disk_map=disk_map, vram_limit=80)
    return model
-
-
-def get_init_context(torch_dtype, device):
-    if is_deepspeed_zero3_enabled():
-        from transformers.modeling_utils import set_zero3_state
-        import deepspeed
-        # Why do we use "deepspeed.zero.Init"?
-        # Weight segmentation of the model can be performed on the CPU side
-        # and loading the segmented weights onto the computing card
-        init_contexts = [deepspeed.zero.Init(remote_device=device, dtype=torch_dtype), set_zero3_state()]
-    else:
-        # Why do we use `skip_model_initialization`?
-        # It skips the random initialization of model parameters,
-        # thereby speeding up model loading and avoiding excessive memory usage.
-        init_contexts = [skip_model_initialization()]
-
-    return init_contexts
--- a/diffsynth/core/vram/layers.py
+++ b/diffsynth/core/vram/layers.py
@@ -2,7 +2,7 @@ import torch, copy
 from typing import Union
 from .initialization import skip_model_initialization
 from .disk_map import DiskMap
-from ..device import parse_device_type, get_device_name, IS_NPU_AVAILABLE
+from ..device import parse_device_type


 class AutoTorchModule(torch.nn.Module):
@@ -63,7 +63,7 @@ class AutoTorchModule(torch.nn.Module):
        return r

    def check_free_vram(self):
-        device = self.computation_device if not IS_NPU_AVAILABLE else get_device_name()
+        device = self.computation_device if self.computation_device != "npu" else "npu:0"
        gpu_mem_state = getattr(torch, self.computation_device_type).mem_get_info(device)
        used_memory = (gpu_mem_state[1] - gpu_mem_state[0]) / (1024**3)
        return used_memory < self.vram_limit
--- a/diffsynth/diffusion/base_pipeline.py
+++ b/diffsynth/diffusion/base_pipeline.py
@@ -4,11 +4,9 @@ import numpy as np
 from einops import repeat, reduce
 from typing import Union
 from ..core import AutoTorchModule, AutoWrappedLinear, load_state_dict, ModelConfig, parse_device_type
-from ..core.device.npu_compatible_device import get_device_type
 from ..utils.lora import GeneralLoRALoader
 from ..models.model_loader import ModelPool
 from ..utils.controlnet import ControlNetInput
-from ..core.device import get_device_name, IS_NPU_AVAILABLE


 class PipelineUnit:
@@ -62,7 +60,7 @@ class BasePipeline(torch.nn.Module):

    def __init__(
        self,
-        device=get_device_type(), torch_dtype=torch.float16,
+        device="cuda", torch_dtype=torch.float16,
        height_division_factor=64, width_division_factor=64,
        time_division_factor=None, time_division_remainder=None,
    ):
@@ -179,7 +177,7 @@ class BasePipeline(torch.nn.Module):

        
    def get_vram(self):
-        device = self.device if not IS_NPU_AVAILABLE else get_device_name()
+        device = self.device if self.device != "npu" else "npu:0"
        return getattr(torch, self.device_type).mem_get_info(device)[1] / (1024 ** 3)
    
    def get_module(self, model, name):
@@ -237,7 +235,6 @@ class BasePipeline(torch.nn.Module):
        alpha=1,
        hotload=None,
        state_dict=None,
-        verbose=1,
    ):
        if state_dict is None:
            if isinstance(lora_config, str):
@@ -264,13 +261,12 @@ class BasePipeline(torch.nn.Module):
                        updated_num += 1
                        module.lora_A_weights.append(lora[lora_a_name] * alpha)
                        module.lora_B_weights.append(lora[lora_b_name])
-            if verbose >= 1:
-                print(f"{updated_num} tensors are patched by LoRA. You can use `pipe.clear_lora()` to clear all LoRA layers.")
+            print(f"{updated_num} tensors are patched by LoRA. You can use `pipe.clear_lora()` to clear all LoRA layers.")
        else:
            lora_loader.fuse_lora_to_base_model(module, lora, alpha=alpha)
            
            
-    def clear_lora(self, verbose=1):
+    def clear_lora(self):
        cleared_num = 0
        for name, module in self.named_modules():
            if isinstance(module, AutoWrappedLinear):
@@ -280,8 +276,7 @@ class BasePipeline(torch.nn.Module):
                    module.lora_A_weights.clear()
                if hasattr(module, "lora_B_weights"):
                    module.lora_B_weights.clear()
-        if verbose >= 1:
-            print(f"{cleared_num} LoRA layers are cleared.")
+        print(f"{cleared_num} LoRA layers are cleared.")
        
    
    def download_and_load_models(self, model_configs: list[ModelConfig] = [], vram_limit: float = None):
@@ -309,13 +304,8 @@ class BasePipeline(torch.nn.Module):
    
    
    def cfg_guided_model_fn(self, model_fn, cfg_scale, inputs_shared, inputs_posi, inputs_nega, **inputs_others):
-        if inputs_shared.get("positive_only_lora", None) is not None:
-            self.clear_lora(verbose=0)
-            self.load_lora(self.dit, state_dict=inputs_shared["positive_only_lora"], verbose=0)
        noise_pred_posi = model_fn(**inputs_posi, **inputs_shared, **inputs_others)
        if cfg_scale != 1.0:
-            if inputs_shared.get("positive_only_lora", None) is not None:
-                self.clear_lora(verbose=0)
            noise_pred_nega = model_fn(**inputs_nega, **inputs_shared, **inputs_others)
            noise_pred = noise_pred_nega + cfg_scale * (noise_pred_posi - noise_pred_nega)
        else:
--- a/diffsynth/diffusion/flow_match.py
+++ b/diffsynth/diffusion/flow_match.py
@@ -89,18 +89,13 @@ class FlowMatchScheduler():
        return float(mu)
    
    @staticmethod
-    def set_timesteps_flux2(num_inference_steps=100, denoising_strength=1.0, dynamic_shift_len=None):
+    def set_timesteps_flux2(num_inference_steps=100, denoising_strength=1.0, dynamic_shift_len=1024//16*1024//16):
        sigma_min = 1 / num_inference_steps
        sigma_max = 1.0
        num_train_timesteps = 1000
        sigma_start = sigma_min + (sigma_max - sigma_min) * denoising_strength
        sigmas = torch.linspace(sigma_start, sigma_min, num_inference_steps)
-        if dynamic_shift_len is None:
-            # If you ask me why I set mu=0.8,
-            # I can only say that it yields better training results.
-            mu = 0.8
-        else:
-            mu = FlowMatchScheduler.compute_empirical_mu(dynamic_shift_len, num_inference_steps)
+        mu = FlowMatchScheduler.compute_empirical_mu(dynamic_shift_len, num_inference_steps)
        sigmas = math.exp(mu) / (math.exp(mu) + (1 / sigmas - 1))
        timesteps = sigmas * num_train_timesteps
        return sigmas, timesteps
--- a/diffsynth/diffusion/logger.py
+++ b/diffsynth/diffusion/logger.py
@@ -10,7 +10,7 @@ class ModelLogger:
        self.num_steps = 0


-    def on_step_end(self, accelerator: Accelerator, model: torch.nn.Module, save_steps=None, **kwargs):
+    def on_step_end(self, accelerator: Accelerator, model: torch.nn.Module, save_steps=None):
        self.num_steps += 1
        if save_steps is not None and self.num_steps % save_steps == 0:
            self.save_model(accelerator, model, f"step-{self.num_steps}.safetensors")
@@ -18,8 +18,8 @@ class ModelLogger:

    def on_epoch_end(self, accelerator: Accelerator, model: torch.nn.Module, epoch_id):
        accelerator.wait_for_everyone()
-        state_dict = accelerator.get_state_dict(model)
        if accelerator.is_main_process:
+            state_dict = accelerator.get_state_dict(model)
            state_dict = accelerator.unwrap_model(model).export_trainable_state_dict(state_dict, remove_prefix=self.remove_prefix_in_ckpt)
            state_dict = self.state_dict_converter(state_dict)
            os.makedirs(self.output_path, exist_ok=True)
@@ -34,8 +34,8 @@ class ModelLogger:

    def save_model(self, accelerator: Accelerator, model: torch.nn.Module, file_name):
        accelerator.wait_for_everyone()
-        state_dict = accelerator.get_state_dict(model)
        if accelerator.is_main_process:
+            state_dict = accelerator.get_state_dict(model)
            state_dict = accelerator.unwrap_model(model).export_trainable_state_dict(state_dict, remove_prefix=self.remove_prefix_in_ckpt)
            state_dict = self.state_dict_converter(state_dict)
            os.makedirs(self.output_path, exist_ok=True)
--- a/diffsynth/diffusion/runner.py
+++ b/diffsynth/diffusion/runner.py
@@ -27,7 +27,7 @@ def launch_training_task(
    optimizer = torch.optim.AdamW(model.trainable_modules(), lr=learning_rate, weight_decay=weight_decay)
    scheduler = torch.optim.lr_scheduler.ConstantLR(optimizer)
    dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, collate_fn=lambda x: x[0], num_workers=num_workers)
-    model.to(device=accelerator.device)
+    
    model, optimizer, dataloader, scheduler = accelerator.prepare(model, optimizer, dataloader, scheduler)
    
    for epoch_id in range(num_epochs):
@@ -40,7 +40,7 @@ def launch_training_task(
                    loss = model(data)
                accelerator.backward(loss)
                optimizer.step()
-                model_logger.on_step_end(accelerator, model, save_steps, loss=loss)
+                model_logger.on_step_end(accelerator, model, save_steps)
                scheduler.step()
        if save_steps is None:
            model_logger.on_epoch_end(accelerator, model, epoch_id)
@@ -59,7 +59,6 @@ def launch_data_process_task(
        num_workers = args.dataset_num_workers
        
    dataloader = torch.utils.data.DataLoader(dataset, shuffle=False, collate_fn=lambda x: x[0], num_workers=num_workers)
-    model.to(device=accelerator.device)
    model, dataloader = accelerator.prepare(model, dataloader)
    
    for data_id, data in enumerate(tqdm(dataloader)):
--- a/diffsynth/diffusion/training_module.py
+++ b/diffsynth/diffusion/training_module.py
@@ -1,4 +1,4 @@
-import torch, json, os
+import torch, json
 from ..core import ModelConfig, load_state_dict
 from ..utils.controlnet import ControlNetInput
 from peft import LoraConfig, inject_adapter_in_model
@@ -127,67 +127,16 @@ class DiffusionTrainingModule(torch.nn.Module):
        if model_id_with_origin_paths is not None:
            model_id_with_origin_paths = model_id_with_origin_paths.split(",")
            for model_id_with_origin_path in model_id_with_origin_paths:
+                model_id, origin_file_pattern = model_id_with_origin_path.split(":")
                vram_config = self.parse_vram_config(
                    fp8=model_id_with_origin_path in fp8_models,
                    offload=model_id_with_origin_path in offload_models,
                    device=device
                )
-                config = self.parse_path_or_model_id(model_id_with_origin_path)
-                model_configs.append(ModelConfig(model_id=config.model_id, origin_file_pattern=config.origin_file_pattern, **vram_config))
+                model_configs.append(ModelConfig(model_id=model_id, origin_file_pattern=origin_file_pattern, **vram_config))
        return model_configs
    
-
-    def parse_path_or_model_id(self, model_id_with_origin_path, default_value=None):
-        if model_id_with_origin_path is None:
-            return default_value
-        elif os.path.exists(model_id_with_origin_path):
-            return ModelConfig(path=model_id_with_origin_path)
-        else:
-            if ":" not in model_id_with_origin_path:
-                raise ValueError(f"Failed to parse model config: {model_id_with_origin_path}. This is neither a valid path nor in the format of `model_id/origin_file_pattern`.")
-            split_id = model_id_with_origin_path.rfind(":")
-            model_id = model_id_with_origin_path[:split_id]
-            origin_file_pattern = model_id_with_origin_path[split_id + 1:]
-            return ModelConfig(model_id=model_id, origin_file_pattern=origin_file_pattern)
-
-
-    def auto_detect_lora_target_modules(
-        self,
-        model: torch.nn.Module,
-        search_for_linear=False,
-        linear_detector=lambda x: min(x.weight.shape) >= 512,
-        block_list_detector=lambda x: isinstance(x, torch.nn.ModuleList) and len(x) > 1,
-        name_prefix="",
-    ):
-        lora_target_modules = []
-        if search_for_linear:
-            for name, module in model.named_modules():
-                module_name = name_prefix + ["", "."][name_prefix != ""] + name
-                if isinstance(module, torch.nn.Linear) and linear_detector(module):
-                    lora_target_modules.append(module_name)
-        else:
-            for name, module in model.named_children():
-                module_name = name_prefix + ["", "."][name_prefix != ""] + name
-                lora_target_modules += self.auto_detect_lora_target_modules(
-                    module,
-                    search_for_linear=block_list_detector(module),
-                    linear_detector=linear_detector,
-                    block_list_detector=block_list_detector,
-                    name_prefix=module_name,
-                )
-        return lora_target_modules
    
-
-    def parse_lora_target_modules(self, model, lora_target_modules):
-        if lora_target_modules == "":
-            print("No LoRA target modules specified. The framework will automatically search for them.")
-            lora_target_modules = self.auto_detect_lora_target_modules(model)
-            print(f"LoRA will be patched at {lora_target_modules}.")
-        else:
-            lora_target_modules = lora_target_modules.split(",")
-        return lora_target_modules
-
-
    def switch_pipe_to_training_mode(
        self,
        pipe,
@@ -217,7 +166,7 @@ class DiffusionTrainingModule(torch.nn.Module):
                return
            model = self.add_lora_to_model(
                getattr(pipe, lora_base_model),
-                target_modules=self.parse_lora_target_modules(getattr(pipe, lora_base_model), lora_target_modules),
+                target_modules=lora_target_modules.split(","),
                lora_rank=lora_rank,
                upcast_dtype=pipe.torch_dtype,
            )
--- a/diffsynth/models/dinov3_image_encoder.py
+++ b/diffsynth/models/dinov3_image_encoder.py
@@ -2,8 +2,6 @@ from transformers import DINOv3ViTModel, DINOv3ViTImageProcessorFast
 from transformers.models.dinov3_vit.modeling_dinov3_vit import DINOv3ViTConfig
 import torch

-from ..core.device.npu_compatible_device import get_device_type
-

 class DINOv3ImageEncoder(DINOv3ViTModel):
    def __init__(self):
@@ -72,7 +70,7 @@ class DINOv3ImageEncoder(DINOv3ViTModel):
            }
        )
        
-    def forward(self, image, torch_dtype=torch.bfloat16, device=get_device_type()):
+    def forward(self, image, torch_dtype=torch.bfloat16, device="cuda"):
        inputs = self.processor(images=image, return_tensors="pt")
        pixel_values = inputs["pixel_values"].to(dtype=torch_dtype, device=device)
        bool_masked_pos = None
--- a/diffsynth/models/flux2_dit.py
+++ b/diffsynth/models/flux2_dit.py
@@ -823,13 +823,7 @@ class Flux2PosEmbed(nn.Module):


 class Flux2TimestepGuidanceEmbeddings(nn.Module):
-    def __init__(
-        self,
-        in_channels: int = 256,
-        embedding_dim: int = 6144,
-        bias: bool = False,
-        guidance_embeds: bool = True,
-    ):
+    def __init__(self, in_channels: int = 256, embedding_dim: int = 6144, bias: bool = False):
        super().__init__()

        self.time_proj = Timesteps(num_channels=in_channels, flip_sin_to_cos=True, downscale_freq_shift=0)
@@ -837,24 +831,20 @@ class Flux2TimestepGuidanceEmbeddings(nn.Module):
            in_channels=in_channels, time_embed_dim=embedding_dim, sample_proj_bias=bias
        )

-        if guidance_embeds:
-            self.guidance_embedder = TimestepEmbedding(
-                in_channels=in_channels, time_embed_dim=embedding_dim, sample_proj_bias=bias
-            )
-        else:
-            self.guidance_embedder = None
+        self.guidance_embedder = TimestepEmbedding(
+            in_channels=in_channels, time_embed_dim=embedding_dim, sample_proj_bias=bias
+        )

    def forward(self, timestep: torch.Tensor, guidance: torch.Tensor) -> torch.Tensor:
        timesteps_proj = self.time_proj(timestep)
        timesteps_emb = self.timestep_embedder(timesteps_proj.to(timestep.dtype))  # (N, D)

-        if guidance is not None and self.guidance_embedder is not None:
-            guidance_proj = self.time_proj(guidance)
-            guidance_emb = self.guidance_embedder(guidance_proj.to(guidance.dtype))  # (N, D)
-            time_guidance_emb = timesteps_emb + guidance_emb
-            return time_guidance_emb
-        else:
-            return timesteps_emb
+        guidance_proj = self.time_proj(guidance)
+        guidance_emb = self.guidance_embedder(guidance_proj.to(guidance.dtype))  # (N, D)
+
+        time_guidance_emb = timesteps_emb + guidance_emb
+
+        return time_guidance_emb


 class Flux2Modulation(nn.Module):
@@ -892,7 +882,6 @@ class Flux2DiT(torch.nn.Module):
        axes_dims_rope: Tuple[int, ...] = (32, 32, 32, 32),
        rope_theta: int = 2000,
        eps: float = 1e-6,
-        guidance_embeds: bool = True,
    ):
        super().__init__()
        self.out_channels = out_channels or in_channels
@@ -903,10 +892,7 @@ class Flux2DiT(torch.nn.Module):

        # 2. Combined timestep + guidance embedding
        self.time_guidance_embed = Flux2TimestepGuidanceEmbeddings(
-            in_channels=timestep_guidance_channels,
-            embedding_dim=self.inner_dim,
-            bias=False,
-            guidance_embeds=guidance_embeds,
+            in_channels=timestep_guidance_channels, embedding_dim=self.inner_dim, bias=False
        )

        # 3. Modulation (double stream and single stream blocks share modulation parameters, resp.)
@@ -967,9 +953,34 @@ class Flux2DiT(torch.nn.Module):
        txt_ids: torch.Tensor = None,
        guidance: torch.Tensor = None,
        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        return_dict: bool = True,
        use_gradient_checkpointing=False,
        use_gradient_checkpointing_offload=False,
-    ):
+    ) -> Union[torch.Tensor]:
+        """
+        The [`FluxTransformer2DModel`] forward method.
+
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`):
+                Input `hidden_states`.
+            encoder_hidden_states (`torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`):
+                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
+            timestep ( `torch.LongTensor`):
+                Used to indicate denoising step.
+            block_controlnet_hidden_states: (`list` of `torch.Tensor`):
+                A list of tensors that if specified are added to the residuals of transformer blocks.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
+                tuple.
+
+        Returns:
+            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+            `tuple` where the first element is the sample tensor.
+        """
        # 0. Handle input arguments
        if joint_attention_kwargs is not None:
            joint_attention_kwargs = joint_attention_kwargs.copy()
@@ -981,9 +992,7 @@ class Flux2DiT(torch.nn.Module):

        # 1. Calculate timestep embedding and modulation parameters
        timestep = timestep.to(hidden_states.dtype) * 1000
-
-        if guidance is not None:
-            guidance = guidance.to(hidden_states.dtype) * 1000
+        guidance = guidance.to(hidden_states.dtype) * 1000

        temb = self.time_guidance_embed(timestep, guidance)

--- a/diffsynth/models/longcat_video_dit.py
+++ b/diffsynth/models/longcat_video_dit.py
@@ -9,7 +9,6 @@ import numpy as np
 import torch.nn.functional as F
 from einops import rearrange, repeat
 from .wan_video_dit import flash_attention
-from ..core.device.npu_compatible_device import get_device_type
 from ..core.gradient import gradient_checkpoint_forward


@@ -374,7 +373,7 @@ class FinalLayer_FP32(nn.Module):
        B, N, C = x.shape
        T, _, _ = latent_shape

-        with amp.autocast(get_device_type(), dtype=torch.float32):
+        with amp.autocast('cuda', dtype=torch.float32):
            shift, scale = self.adaLN_modulation(t).unsqueeze(2).chunk(2, dim=-1) # [B, T, 1, C]
            x = modulate_fp32(self.norm_final, x.view(B, T, -1, C), shift, scale).view(B, N, C)
            x = self.linear(x)
@@ -584,7 +583,7 @@ class LongCatSingleStreamBlock(nn.Module):
        T, _, _ = latent_shape # S != T*H*W in case of CP split on H*W.

        # compute modulation params in fp32
-        with amp.autocast(device_type=get_device_type(), dtype=torch.float32):
+        with amp.autocast(device_type='cuda', dtype=torch.float32):
            shift_msa, scale_msa, gate_msa, \
            shift_mlp, scale_mlp, gate_mlp = \
                self.adaLN_modulation(t).unsqueeze(2).chunk(6, dim=-1) # [B, T, 1, C]
@@ -603,7 +602,7 @@ class LongCatSingleStreamBlock(nn.Module):
        else:
            x_s = attn_outputs

-        with amp.autocast(device_type=get_device_type(), dtype=torch.float32):
+        with amp.autocast(device_type='cuda', dtype=torch.float32):
            x = x + (gate_msa * x_s.view(B, -1, N//T, C)).view(B, -1, C) # [B, N, C]
        x = x.to(x_dtype)

@@ -616,7 +615,7 @@ class LongCatSingleStreamBlock(nn.Module):
        # ffn with modulation
        x_m = modulate_fp32(self.mod_norm_ffn, x.view(B, -1, N//T, C), shift_mlp, scale_mlp).view(B, -1, C)
        x_s = self.ffn(x_m)
-        with amp.autocast(device_type=get_device_type(), dtype=torch.float32):
+        with amp.autocast(device_type='cuda', dtype=torch.float32):
            x = x + (gate_mlp * x_s.view(B, -1, N//T, C)).view(B, -1, C) # [B, N, C]
        x = x.to(x_dtype)

@@ -798,7 +797,7 @@ class LongCatVideoTransformer3DModel(torch.nn.Module):

        hidden_states = self.x_embedder(hidden_states)  # [B, N, C]

-        with amp.autocast(device_type=get_device_type(), dtype=torch.float32):
+        with amp.autocast(device_type='cuda', dtype=torch.float32):
            t = self.t_embedder(timestep.float().flatten(), dtype=torch.float32).reshape(B, N_t, -1)  # [B, T, C_t]

        encoder_hidden_states = self.y_embedder(encoder_hidden_states)  # [B, 1, N_token, C]
--- a/diffsynth/models/nexus_gen_ar_model.py
+++ b/diffsynth/models/nexus_gen_ar_model.py
@@ -583,7 +583,7 @@ class Qwen2_5_VLForConditionalGeneration(Qwen2_5_VLPreTrainedModel, GenerationMi
            is_compileable = model_kwargs["past_key_values"].is_compileable and self._supports_static_cache
            is_compileable = is_compileable and not self.generation_config.disable_compile
            if is_compileable and (
-                self.device.type in ["cuda", "npu"] or generation_config.compile_config._compile_all_devices
+                self.device.type == "cuda" or generation_config.compile_config._compile_all_devices
            ):
                os.environ["TOKENIZERS_PARALLELISM"] = "0"
                model_forward = self.get_compiled_call(generation_config.compile_config)
--- a/diffsynth/models/siglip2_image_encoder.py
+++ b/diffsynth/models/siglip2_image_encoder.py
@@ -2,8 +2,6 @@ from transformers.models.siglip.modeling_siglip import SiglipVisionTransformer,
 from transformers import SiglipImageProcessor, Siglip2VisionModel, Siglip2VisionConfig, Siglip2ImageProcessorFast
 import torch

-from diffsynth.core.device.npu_compatible_device import get_device_type
-

 class Siglip2ImageEncoder(SiglipVisionTransformer):
    def __init__(self):
@@ -49,7 +47,7 @@ class Siglip2ImageEncoder(SiglipVisionTransformer):
            }
        )
        
-    def forward(self, image, torch_dtype=torch.bfloat16, device=get_device_type()):
+    def forward(self, image, torch_dtype=torch.bfloat16, device="cuda"):
        pixel_values = self.processor(images=[image], return_tensors="pt")["pixel_values"]
        pixel_values = pixel_values.to(device=device, dtype=torch_dtype)
        output_attentions = False
@@ -92,10 +90,12 @@ class Siglip2ImageEncoder428M(Siglip2VisionModel):
        super().__init__(config)
        self.processor = Siglip2ImageProcessorFast(
            **{
+                "crop_size": None,
                "data_format": "channels_first",
                "default_to_square": True,
                "device": None,
                "disable_grouping": None,
+                "do_center_crop": None,
                "do_convert_rgb": None,
                "do_normalize": True,
                "do_pad": None,
@@ -120,6 +120,7 @@ class Siglip2ImageEncoder428M(Siglip2VisionModel):
                "resample": 2,
                "rescale_factor": 0.00392156862745098,
                "return_tensors": None,
+                "size": None
            }
        )
        
--- a/diffsynth/models/step1x_text_encoder.py
+++ b/diffsynth/models/step1x_text_encoder.py
@@ -1,11 +1,10 @@
 import torch
 from typing import Optional, Union
 from .qwen_image_text_encoder import QwenImageTextEncoder
-from ..core.device.npu_compatible_device import get_device_type, get_torch_device


 class Step1xEditEmbedder(torch.nn.Module):
-    def __init__(self, model: QwenImageTextEncoder, processor, max_length=640, dtype=torch.bfloat16, device=get_device_type()):
+    def __init__(self, model: QwenImageTextEncoder, processor, max_length=640, dtype=torch.bfloat16, device="cuda"):
        super().__init__()
        self.max_length = max_length
        self.dtype = dtype
@@ -78,13 +77,13 @@ User Prompt:'''
            self.max_length,
            self.model.config.hidden_size,
            dtype=torch.bfloat16,
-            device=get_torch_device().current_device(),
+            device=torch.cuda.current_device(),
        )
        masks = torch.zeros(
            len(text_list),
            self.max_length,
            dtype=torch.long,
-            device=get_torch_device().current_device(),
+            device=torch.cuda.current_device(),
        )

        def split_string(s):
@@ -159,7 +158,7 @@ User Prompt:'''
                else:
                    token_list.append(token_each)

-            new_txt_ids = torch.cat(token_list, dim=1).to(get_device_type())
+            new_txt_ids = torch.cat(token_list, dim=1).to("cuda")

            new_txt_ids = new_txt_ids.to(old_inputs_ids.device)

@@ -168,15 +167,15 @@ User Prompt:'''
            inputs.input_ids = (
                torch.cat([old_inputs_ids[0, :idx1], new_txt_ids[0, idx2:]], dim=0)
                .unsqueeze(0)
-                .to(get_device_type())
+                .to("cuda")
            )
-            inputs.attention_mask = (inputs.input_ids > 0).long().to(get_device_type())
+            inputs.attention_mask = (inputs.input_ids > 0).long().to("cuda")
            outputs = self.model_forward(
                self.model,
                input_ids=inputs.input_ids,
                attention_mask=inputs.attention_mask,
-                pixel_values=inputs.pixel_values.to(get_device_type()),
-                image_grid_thw=inputs.image_grid_thw.to(get_device_type()),
+                pixel_values=inputs.pixel_values.to("cuda"),
+                image_grid_thw=inputs.image_grid_thw.to("cuda"),
                output_hidden_states=True,
            )

@@ -189,7 +188,7 @@ User Prompt:'''
            masks[idx, : min(self.max_length, emb.shape[1] - 217)] = torch.ones(
                (min(self.max_length, emb.shape[1] - 217)),
                dtype=torch.long,
-                device=get_torch_device().current_device(),
+                device=torch.cuda.current_device(),
            )

        return embs, masks
--- a/diffsynth/models/wan_video_dit.py
+++ b/diffsynth/models/wan_video_dit.py
@@ -5,8 +5,6 @@ import math
 from typing import Tuple, Optional
 from einops import rearrange
 from .wan_video_camera_controller import SimpleAdapter
-from ..core.gradient import gradient_checkpoint_forward
-
 try:
    import flash_attn_interface
    FLASH_ATTN_3_AVAILABLE = True
@@ -94,7 +92,6 @@ def rope_apply(x, freqs, num_heads):
    x = rearrange(x, "b s (n d) -> b s n d", n=num_heads)
    x_out = torch.view_as_complex(x.to(torch.float64).reshape(
        x.shape[0], x.shape[1], x.shape[2], -1, 2))
-    freqs = freqs.to(torch.complex64) if freqs.device == "npu" else freqs
    x_out = torch.view_as_real(x_out * freqs).flatten(2)
    return x_out.to(x.dtype)

@@ -380,15 +377,27 @@ class WanModel(torch.nn.Module):
            self.freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
            self.freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
        ], dim=-1).reshape(f * h * w, 1, -1).to(x.device)
+        
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+            return custom_forward

        for block in self.blocks:
-            if self.training:
-                x = gradient_checkpoint_forward(
-                    block,
-                    use_gradient_checkpointing,
-                    use_gradient_checkpointing_offload,
-                    x, context, t_mod, freqs
-                )
+            if self.training and use_gradient_checkpointing:
+                if use_gradient_checkpointing_offload:
+                    with torch.autograd.graph.save_on_cpu():
+                        x = torch.utils.checkpoint.checkpoint(
+                            create_custom_forward(block),
+                            x, context, t_mod, freqs,
+                            use_reentrant=False,
+                        )
+                else:
+                    x = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(block),
+                        x, context, t_mod, freqs,
+                        use_reentrant=False,
+                    )
            else:
                x = block(x, context, t_mod, freqs)

--- a/diffsynth/models/wan_video_dit_s2v.py
+++ b/diffsynth/models/wan_video_dit_s2v.py
@@ -4,7 +4,6 @@ import torch.nn as nn
 import torch.nn.functional as F
 from typing import Tuple
 from .wan_video_dit import rearrange, precompute_freqs_cis_3d, DiTBlock, Head, CrossAttention, modulate, sinusoidal_embedding_1d
-from ..core.gradient import gradient_checkpoint_forward


 def torch_dfs(model: nn.Module, parent_name='root'):
@@ -546,19 +545,46 @@ class WanS2VModel(torch.nn.Module):
        t = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, timestep))
        t_mod = self.time_projection(t).unflatten(1, (6, self.dim)).unsqueeze(2).transpose(0, 2)

+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+            return custom_forward
+
        for block_id, block in enumerate(self.blocks):
-            x = gradient_checkpoint_forward(
-                block,
-                use_gradient_checkpointing,
-                use_gradient_checkpointing_offload,
-                x, context, t_mod, seq_len_x, pre_compute_freqs[0]
-            )
-            x = gradient_checkpoint_forward(
-                lambda x: self.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x),
-                use_gradient_checkpointing,
-                use_gradient_checkpointing_offload,
-                x
-            )
+            if use_gradient_checkpointing_offload:
+                with torch.autograd.graph.save_on_cpu():
+                    x = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(block),
+                        x,
+                        context,
+                        t_mod,
+                        seq_len_x,
+                        pre_compute_freqs[0],
+                        use_reentrant=False,
+                    )
+                    x = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(lambda x: self.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x)),
+                        x,
+                        use_reentrant=False,
+                    )
+            elif use_gradient_checkpointing:
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    x,
+                    context,
+                    t_mod,
+                    seq_len_x,
+                    pre_compute_freqs[0],
+                    use_reentrant=False,
+                )
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(lambda x: self.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x)),
+                    x,
+                    use_reentrant=False,
+                )
+            else:
+                x = block(x, context, t_mod, seq_len_x, pre_compute_freqs[0])
+                x = self.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x)

        x = x[:, :seq_len_x]
        x = self.head(x, t[:-1])
--- a/diffsynth/models/wan_video_vace.py
+++ b/diffsynth/models/wan_video_vace.py
@@ -1,6 +1,6 @@
 import torch
 from .wan_video_dit import DiTBlock
-from ..core.gradient import gradient_checkpoint_forward
+

 class VaceWanAttentionBlock(DiTBlock):
    def __init__(self, has_image_input, dim, num_heads, ffn_dim, eps=1e-6, block_id=0):
@@ -62,13 +62,26 @@ class VaceWanModel(torch.nn.Module):
                      dim=1) for u in c
        ])
        
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+            return custom_forward
+        
        for block in self.vace_blocks:
-            c = gradient_checkpoint_forward(
-                block,
-                use_gradient_checkpointing,
-                use_gradient_checkpointing_offload,
-                c, x, context, t_mod, freqs
-            )
-            
+            if use_gradient_checkpointing_offload:
+                with torch.autograd.graph.save_on_cpu():
+                    c = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(block),
+                        c, x, context, t_mod, freqs,
+                        use_reentrant=False,
+                    )
+            elif use_gradient_checkpointing:
+                c = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    c, x, context, t_mod, freqs,
+                    use_reentrant=False,
+                )
+            else:
+                c = block(c, x, context, t_mod, freqs)
        hints = torch.unbind(c)[:-1]
        return hints
--- a/diffsynth/models/wan_video_vae.py
+++ b/diffsynth/models/wan_video_vae.py
@@ -171,7 +171,7 @@ class Resample(nn.Module):
                        torch.cat([feat_cache[idx][:, :, -1:, :, :], x], 2))
                    feat_cache[idx] = cache_x
                    feat_idx[0] += 1
-        return x, feat_cache, feat_idx
+        return x

    def init_weight(self, conv):
        conv_weight = conv.weight
@@ -298,7 +298,7 @@ class ResidualBlock(nn.Module):
                feat_idx[0] += 1
            else:
                x = layer(x)
-        return x + h, feat_cache, feat_idx
+        return x + h


 class AttentionBlock(nn.Module):
@@ -471,7 +471,7 @@ class Down_ResidualBlock(nn.Module):
        for module in self.downsamples:
            x = module(x, feat_cache, feat_idx)

-        return x + self.avg_shortcut(x_copy), feat_cache, feat_idx
+        return x + self.avg_shortcut(x_copy)


 class Up_ResidualBlock(nn.Module):
@@ -511,7 +511,7 @@ class Up_ResidualBlock(nn.Module):
            x_shortcut = self.avg_shortcut(x, first_chunk)
            return x_main + x_shortcut
        else:
-            return x_main, feat_cache, feat_idx
+            return x_main


 class Encoder3d(nn.Module):
@@ -586,14 +586,14 @@ class Encoder3d(nn.Module):
        ## downsamples
        for layer in self.downsamples:
            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

        ## middle
        for layer in self.middle:
            if check_is_instance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

@@ -614,7 +614,7 @@ class Encoder3d(nn.Module):
                feat_idx[0] += 1
            else:
                x = layer(x)
-        return x, feat_cache, feat_idx
+        return x


 class Encoder3d_38(nn.Module):
@@ -698,14 +698,14 @@ class Encoder3d_38(nn.Module):
        ## downsamples
        for layer in self.downsamples:
            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

        ## middle
        for layer in self.middle:
            if isinstance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

@@ -730,7 +730,7 @@ class Encoder3d_38(nn.Module):
            else:
                x = layer(x)

-        return x, feat_cache, feat_idx
+        return x


 class Decoder3d(nn.Module):
@@ -807,14 +807,14 @@ class Decoder3d(nn.Module):
        ## middle
        for layer in self.middle:
            if check_is_instance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

        ## upsamples
        for layer in self.upsamples:
            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

@@ -835,7 +835,7 @@ class Decoder3d(nn.Module):
                feat_idx[0] += 1
            else:
                x = layer(x)
-        return x, feat_cache, feat_idx
+        return x



@@ -906,14 +906,14 @@ class Decoder3d_38(nn.Module):

        for layer in self.middle:
            if check_is_instance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
+                x = layer(x, feat_cache, feat_idx)
            else:
                x = layer(x)

        ## upsamples
        for layer in self.upsamples:
            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx, first_chunk)
+                x = layer(x, feat_cache, feat_idx, first_chunk)
            else:
                x = layer(x)

@@ -937,7 +937,7 @@ class Decoder3d_38(nn.Module):
                feat_idx[0] += 1
            else:
                x = layer(x)
-        return x, feat_cache, feat_idx
+        return x


 def count_conv3d(model):
@@ -990,11 +990,11 @@ class VideoVAE_(nn.Module):
        for i in range(iter_):
            self._enc_conv_idx = [0]
            if i == 0:
-                out, self._enc_feat_map, self._enc_conv_idx = self.encoder(x[:, :, :1, :, :],
+                out = self.encoder(x[:, :, :1, :, :],
                                   feat_cache=self._enc_feat_map,
                                   feat_idx=self._enc_conv_idx)
            else:
-                out_, self._enc_feat_map, self._enc_conv_idx = self.encoder(x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
+                out_ = self.encoder(x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
                                    feat_cache=self._enc_feat_map,
                                    feat_idx=self._enc_conv_idx)
                out = torch.cat([out, out_], 2)
--- a/diffsynth/models/z_image_controlnet.py
+++ b/diffsynth/models/z_image_controlnet.py
@@ -1,154 +0,0 @@
-from .z_image_dit import ZImageTransformerBlock
-from ..core.gradient import gradient_checkpoint_forward
-from torch.nn.utils.rnn import pad_sequence
-import torch
-from torch import nn
-
-
-class ZImageControlTransformerBlock(ZImageTransformerBlock):
-    def __init__(
-        self, 
-        layer_id: int = 1000,
-        dim: int = 3840,
-        n_heads: int = 30,
-        n_kv_heads: int = 30,
-        norm_eps: float = 1e-5,
-        qk_norm: bool = True,
-        modulation = True,
-        block_id = 0
-    ):
-        super().__init__(layer_id, dim, n_heads, n_kv_heads, norm_eps, qk_norm, modulation)
-        self.block_id = block_id
-        if block_id == 0:
-            self.before_proj = nn.Linear(self.dim, self.dim)
-        self.after_proj = nn.Linear(self.dim, self.dim)
-
-    def forward(self, c, x, **kwargs):
-        if self.block_id == 0:
-            c = self.before_proj(c) + x
-            all_c = []
-        else:
-            all_c = list(torch.unbind(c))
-            c = all_c.pop(-1)
-
-        c = super().forward(c, **kwargs)
-        c_skip = self.after_proj(c)
-        all_c += [c_skip, c]
-        c = torch.stack(all_c)
-        return c
-
-
-class ZImageControlNet(torch.nn.Module):
-    def __init__(
-        self,
-        control_layers_places=(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28),
-        control_in_dim=33,
-        dim=3840,
-        n_refiner_layers=2,
-    ):
-        super().__init__()
-        self.control_layers = nn.ModuleList([ZImageControlTransformerBlock(layer_id=i, block_id=i) for i in control_layers_places])
-        self.control_all_x_embedder = nn.ModuleDict({"2-1": nn.Linear(1 * 2 * 2 * control_in_dim, dim, bias=True)})
-        self.control_noise_refiner = nn.ModuleList([ZImageControlTransformerBlock(block_id=layer_id) for layer_id in range(n_refiner_layers)])
-        self.control_layers_mapping = {0: 0, 2: 1, 4: 2, 6: 3, 8: 4, 10: 5, 12: 6, 14: 7, 16: 8, 18: 9, 20: 10, 22: 11, 24: 12, 26: 13, 28: 14}
-
-    def forward_layers(
-        self,
-        x,
-        cap_feats,
-        control_context,
-        control_context_item_seqlens,
-        kwargs,
-        use_gradient_checkpointing=False,
-        use_gradient_checkpointing_offload=False,
-    ):
-        bsz = len(control_context)
-        # unified
-        cap_item_seqlens = [len(_) for _ in cap_feats]
-        control_context_unified = []
-        for i in range(bsz):
-            control_context_len = control_context_item_seqlens[i]
-            cap_len = cap_item_seqlens[i]
-            control_context_unified.append(torch.cat([control_context[i][:control_context_len], cap_feats[i][:cap_len]]))
-        c = pad_sequence(control_context_unified, batch_first=True, padding_value=0.0)
-
-        # arguments
-        new_kwargs = dict(x=x)
-        new_kwargs.update(kwargs)
-        
-        for layer in self.control_layers:
-            c = gradient_checkpoint_forward(
-                layer,
-                use_gradient_checkpointing=use_gradient_checkpointing,
-                use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
-                c=c, **new_kwargs
-            )
- 
-        hints = torch.unbind(c)[:-1]
-        return hints
-    
-    def forward_refiner(
-        self,
-        dit,
-        x,
-        cap_feats,
-        control_context,
-        kwargs,
-        t=None,
-        patch_size=2,
-        f_patch_size=1,
-        use_gradient_checkpointing=False,
-        use_gradient_checkpointing_offload=False,
-    ):
-        # embeddings
-        bsz = len(control_context)
-        device = control_context[0].device
-        (
-            control_context,
-            control_context_size,
-            control_context_pos_ids,
-            control_context_inner_pad_mask,
-        ) = dit.patchify_controlnet(control_context, patch_size, f_patch_size, cap_feats[0].size(0))
-
-        # control_context embed & refine
-        control_context_item_seqlens = [len(_) for _ in control_context]
-        assert all(_ % 2 == 0 for _ in control_context_item_seqlens)
-        control_context_max_item_seqlen = max(control_context_item_seqlens)
-
-        control_context = torch.cat(control_context, dim=0)
-        control_context = self.control_all_x_embedder[f"{patch_size}-{f_patch_size}"](control_context)
-
-        # Match t_embedder output dtype to control_context for layerwise casting compatibility
-        adaln_input = t.type_as(control_context)
-        control_context[torch.cat(control_context_inner_pad_mask)] = dit.x_pad_token.to(dtype=control_context.dtype, device=control_context.device)
-        control_context = list(control_context.split(control_context_item_seqlens, dim=0))
-        control_context_freqs_cis = list(dit.rope_embedder(torch.cat(control_context_pos_ids, dim=0)).split(control_context_item_seqlens, dim=0))
-
-        control_context = pad_sequence(control_context, batch_first=True, padding_value=0.0)
-        control_context_freqs_cis = pad_sequence(control_context_freqs_cis, batch_first=True, padding_value=0.0)
-        control_context_attn_mask = torch.zeros((bsz, control_context_max_item_seqlen), dtype=torch.bool, device=device)
-        for i, seq_len in enumerate(control_context_item_seqlens):
-            control_context_attn_mask[i, :seq_len] = 1
-        c = control_context
-
-        # arguments
-        new_kwargs = dict(
-            x=x, 
-            attn_mask=control_context_attn_mask,
-            freqs_cis=control_context_freqs_cis, 
-            adaln_input=adaln_input,
-        )
-        new_kwargs.update(kwargs)
-        
-        for layer in self.control_noise_refiner:
-            c = gradient_checkpoint_forward(
-                layer,
-                use_gradient_checkpointing=use_gradient_checkpointing,
-                use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
-                c=c, **new_kwargs
-            )
- 
-        hints = torch.unbind(c)[:-1]
-        control_context = torch.unbind(c)[-1]
-
-        return hints, control_context, control_context_item_seqlens
--- a/diffsynth/models/z_image_dit.py
+++ b/diffsynth/models/z_image_dit.py
@@ -6,9 +6,8 @@ import torch.nn as nn
 import torch.nn.functional as F
 from torch.nn.utils.rnn import pad_sequence

-from .general_modules import RMSNorm
+from torch.nn import RMSNorm
 from ..core.attention import attention_forward
-from ..core.device.npu_compatible_device import IS_NPU_AVAILABLE, get_device_type
 from ..core.gradient import gradient_checkpoint_forward


@@ -40,7 +39,7 @@ class TimestepEmbedder(nn.Module):

    @staticmethod
    def timestep_embedding(t, dim, max_period=10000):
-        with torch.amp.autocast(get_device_type(), enabled=False):
+        with torch.amp.autocast("cuda", enabled=False):
            half = dim // 2
            freqs = torch.exp(
                -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32, device=t.device) / half
@@ -105,7 +104,7 @@ class Attention(torch.nn.Module):

        # Apply RoPE
        def apply_rotary_emb(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
-            with torch.amp.autocast(get_device_type(), enabled=False):
+            with torch.amp.autocast("cuda", enabled=False):
                x = torch.view_as_complex(x_in.float().reshape(*x_in.shape[:-1], -1, 2))
                freqs_cis = freqs_cis.unsqueeze(2)
                x_out = torch.view_as_real(x * freqs_cis).flatten(3)
@@ -316,10 +315,7 @@ class RopeEmbedder:
        result = []
        for i in range(len(self.axes_dims)):
            index = ids[:, i]
-            if IS_NPU_AVAILABLE:
-                result.append(torch.index_select(self.freqs_cis[i], 0, index))
-            else:
-                result.append(self.freqs_cis[i][index])
+            result.append(self.freqs_cis[i][index])
        return torch.cat(result, dim=-1)


@@ -613,72 +609,6 @@ class ZImageDiT(nn.Module):
    #         all_img_pad_mask,
    #         all_cap_pad_mask,
    #     )
-
-    def patchify_controlnet(
-        self,
-        all_image: List[torch.Tensor],
-        patch_size: int = 2,
-        f_patch_size: int = 1,
-        cap_padding_len: int = None,
-    ):
-        pH = pW = patch_size
-        pF = f_patch_size
-        device = all_image[0].device
-
-        all_image_out = []
-        all_image_size = []
-        all_image_pos_ids = []
-        all_image_pad_mask = []
-
-        for i, image in enumerate(all_image):
-            ### Process Image
-            C, F, H, W = image.size()
-            all_image_size.append((F, H, W))
-            F_tokens, H_tokens, W_tokens = F // pF, H // pH, W // pW
-
-            image = image.view(C, F_tokens, pF, H_tokens, pH, W_tokens, pW)
-            # "c f pf h ph w pw -> (f h w) (pf ph pw c)"
-            image = image.permute(1, 3, 5, 2, 4, 6, 0).reshape(F_tokens * H_tokens * W_tokens, pF * pH * pW * C)
-
-            image_ori_len = len(image)
-            image_padding_len = (-image_ori_len) % SEQ_MULTI_OF
-
-            image_ori_pos_ids = self.create_coordinate_grid(
-                size=(F_tokens, H_tokens, W_tokens),
-                start=(cap_padding_len + 1, 0, 0),
-                device=device,
-            ).flatten(0, 2)
-            image_padding_pos_ids = (
-                self.create_coordinate_grid(
-                    size=(1, 1, 1),
-                    start=(0, 0, 0),
-                    device=device,
-                )
-                .flatten(0, 2)
-                .repeat(image_padding_len, 1)
-            )
-            image_padded_pos_ids = torch.cat([image_ori_pos_ids, image_padding_pos_ids], dim=0)
-            all_image_pos_ids.append(image_padded_pos_ids)
-            # pad mask
-            all_image_pad_mask.append(
-                torch.cat(
-                    [
-                        torch.zeros((image_ori_len,), dtype=torch.bool, device=device),
-                        torch.ones((image_padding_len,), dtype=torch.bool, device=device),
-                    ],
-                    dim=0,
-                )
-            )
-            # padded feature
-            image_padded_feat = torch.cat([image, image[-1:].repeat(image_padding_len, 1)], dim=0)
-            all_image_out.append(image_padded_feat)
-
-        return (
-            all_image_out,
-            all_image_size,
-            all_image_pos_ids,
-            all_image_pad_mask,
-        )
    
    def _prepare_sequence(
        self,
@@ -696,7 +626,7 @@ class ZImageDiT(nn.Module):

        # Pad token
        feats_cat = torch.cat(feats, dim=0)
-        feats_cat[torch.cat(inner_pad_mask)] = pad_token.to(dtype=feats_cat.dtype, device=feats_cat.device)
+        feats_cat[torch.cat(inner_pad_mask)] = pad_token
        feats = list(feats_cat.split(item_seqlens, dim=0))

        # RoPE
--- a/diffsynth/models/z_image_image2lora.py
+++ b/diffsynth/models/z_image_image2lora.py
@@ -1,189 +0,0 @@
-import torch
-from .qwen_image_image2lora import ImageEmbeddingToLoraMatrix, SequencialMLP
-
-
-class LoRATrainerBlock(torch.nn.Module):
-    def __init__(self, lora_patterns, in_dim=1536+4096, compress_dim=128, rank=4, block_id=0, use_residual=True, residual_length=64+7, residual_dim=3584, residual_mid_dim=1024, prefix="transformer_blocks"):
-        super().__init__()
-        self.prefix = prefix
-        self.lora_patterns = lora_patterns
-        self.block_id = block_id
-        self.layers = []
-        for name, lora_a_dim, lora_b_dim in self.lora_patterns:
-            self.layers.append(ImageEmbeddingToLoraMatrix(in_dim, compress_dim, lora_a_dim, lora_b_dim, rank))
-        self.layers = torch.nn.ModuleList(self.layers)
-        if use_residual:
-            self.proj_residual = SequencialMLP(residual_length, residual_dim, residual_mid_dim, compress_dim)
-        else:
-            self.proj_residual = None
-    
-    def forward(self, x, residual=None):
-        lora = {}
-        if self.proj_residual is not None: residual = self.proj_residual(residual)
-        for lora_pattern, layer in zip(self.lora_patterns, self.layers):
-            name = lora_pattern[0]
-            lora_a, lora_b = layer(x, residual=residual)
-            lora[f"{self.prefix}.{self.block_id}.{name}.lora_A.default.weight"] = lora_a
-            lora[f"{self.prefix}.{self.block_id}.{name}.lora_B.default.weight"] = lora_b
-        return lora
-
-
-class ZImageImage2LoRAComponent(torch.nn.Module):
-    def __init__(self, lora_patterns, prefix, num_blocks=60, use_residual=True, compress_dim=128, rank=4, residual_length=64+7, residual_mid_dim=1024):
-        super().__init__()
-        self.lora_patterns = lora_patterns
-        self.num_blocks = num_blocks
-        self.blocks = []
-        for lora_patterns in self.lora_patterns:
-            for block_id in range(self.num_blocks):
-                self.blocks.append(LoRATrainerBlock(lora_patterns, block_id=block_id, use_residual=use_residual, compress_dim=compress_dim, rank=rank, residual_length=residual_length, residual_mid_dim=residual_mid_dim, prefix=prefix))
-        self.blocks = torch.nn.ModuleList(self.blocks)
-        self.residual_scale = 0.05
-        self.use_residual = use_residual
-        
-    def forward(self, x, residual=None):
-        if residual is not None:
-            if self.use_residual:
-                residual = residual * self.residual_scale
-            else:
-                residual = None
-        lora = {}
-        for block in self.blocks:
-            lora.update(block(x, residual))
-        return lora
-
-
-class ZImageImage2LoRAModel(torch.nn.Module):
-    def __init__(self, use_residual=False, compress_dim=64, rank=4, residual_length=64+7, residual_mid_dim=1024):
-        super().__init__()
-        lora_patterns = [
-            [
-                ("attention.to_q", 3840, 3840),
-                ("attention.to_k", 3840, 3840),
-                ("attention.to_v", 3840, 3840),
-                ("attention.to_out.0", 3840, 3840),
-            ],
-            [
-                ("feed_forward.w1", 3840, 10240),
-                ("feed_forward.w2", 10240, 3840),
-                ("feed_forward.w3", 3840, 10240),
-            ],
-        ]
-        config = {
-            "lora_patterns": lora_patterns,
-            "use_residual": use_residual,
-            "compress_dim": compress_dim,
-            "rank": rank,
-            "residual_length": residual_length,
-            "residual_mid_dim": residual_mid_dim,
-        }
-        self.layers_lora = ZImageImage2LoRAComponent(
-            prefix="layers",
-            num_blocks=30,
-            **config,
-        )
-        self.context_refiner_lora = ZImageImage2LoRAComponent(
-            prefix="context_refiner",
-            num_blocks=2,
-            **config,
-        )
-        self.noise_refiner_lora = ZImageImage2LoRAComponent(
-            prefix="noise_refiner",
-            num_blocks=2,
-            **config,
-        )
-        
-    def forward(self, x, residual=None):
-        lora = {}
-        lora.update(self.layers_lora(x, residual=residual))
-        lora.update(self.context_refiner_lora(x, residual=residual))
-        lora.update(self.noise_refiner_lora(x, residual=residual))
-        return lora
-
-    def initialize_weights(self):
-        state_dict = self.state_dict()
-        for name in state_dict:
-            if ".proj_a." in name:
-                state_dict[name] = state_dict[name] * 0.3
-            elif ".proj_b.proj_out." in name:
-                state_dict[name] = state_dict[name] * 0
-            elif ".proj_residual.proj_out." in name:
-                state_dict[name] = state_dict[name] * 0.3
-        self.load_state_dict(state_dict)
-
-
-class ImageEmb2LoRAWeightCompressed(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, emb_dim, rank):
-        super().__init__()
-        self.lora_a = torch.nn.Parameter(torch.randn((rank, in_dim)))
-        self.lora_b = torch.nn.Parameter(torch.randn((out_dim, rank)))
-        self.proj = torch.nn.Linear(emb_dim, rank * rank, bias=True)
-        self.rank = rank
-    
-    def forward(self, x):
-        x = self.proj(x).view(self.rank, self.rank)
-        lora_a = x @ self.lora_a
-        lora_b = self.lora_b
-        return lora_a, lora_b
-
-
-class ZImageImage2LoRAModelCompressed(torch.nn.Module):
-    def __init__(self, emb_dim=1536+4096, rank=32):
-        super().__init__()
-        target_layers = [
-            ("attention.to_q", 3840, 3840),
-            ("attention.to_k", 3840, 3840),
-            ("attention.to_v", 3840, 3840),
-            ("attention.to_out.0", 3840, 3840),
-            ("feed_forward.w1", 3840, 10240),
-            ("feed_forward.w2", 10240, 3840),
-            ("feed_forward.w3", 3840, 10240),
-        ]
-        self.lora_patterns = [
-            {
-                "prefix": "layers",
-                "num_layers": 30,
-                "target_layers": target_layers,
-            },
-            {
-                "prefix": "context_refiner",
-                "num_layers": 2,
-                "target_layers": target_layers,
-            },
-            {
-                "prefix": "noise_refiner",
-                "num_layers": 2,
-                "target_layers": target_layers,
-            },
-        ]
-        module_dict = {}
-        for lora_pattern in self.lora_patterns:
-            prefix, num_layers, target_layers = lora_pattern["prefix"], lora_pattern["num_layers"], lora_pattern["target_layers"]
-            for layer_id in range(num_layers):
-                for layer_name, in_dim, out_dim in target_layers:
-                    name = f"{prefix}.{layer_id}.{layer_name}".replace(".", "___")
-                    model = ImageEmb2LoRAWeightCompressed(in_dim, out_dim, emb_dim, rank)
-                    module_dict[name] = model
-        self.module_dict = torch.nn.ModuleDict(module_dict)
-
-    def forward(self, x, residual=None):
-        lora = {}
-        for name, module in self.module_dict.items():
-            name = name.replace("___", ".")
-            name_a, name_b = f"{name}.lora_A.default.weight", f"{name}.lora_B.default.weight"
-            lora_a, lora_b = module(x)
-            lora[name_a] = lora_a
-            lora[name_b] = lora_b
-        return lora
-
-    def initialize_weights(self):
-        state_dict = self.state_dict()
-        for name in state_dict:
-            if "lora_b" in name:
-                state_dict[name] = state_dict[name] * 0
-            elif "lora_a" in name:
-                state_dict[name] = state_dict[name] * 0.2
-            elif "proj.weight" in name:
-                print(name)
-                state_dict[name] = state_dict[name] * 0.2
-        self.load_state_dict(state_dict)
--- a/diffsynth/models/z_image_text_encoder.py
+++ b/diffsynth/models/z_image_text_encoder.py
@@ -3,71 +3,38 @@ import torch


 class ZImageTextEncoder(torch.nn.Module):
-    def __init__(self, model_size="4B"):
+    def __init__(self):
        super().__init__()
-        config_dict = {
-            "4B": Qwen3Config(**{
-                "architectures": [
-                    "Qwen3ForCausalLM"
-                ],
-                "attention_bias": False,
-                "attention_dropout": 0.0,
-                "bos_token_id": 151643,
-                "eos_token_id": 151645,
-                "head_dim": 128,
-                "hidden_act": "silu",
-                "hidden_size": 2560,
-                "initializer_range": 0.02,
-                "intermediate_size": 9728,
-                "max_position_embeddings": 40960,
-                "max_window_layers": 36,
-                "model_type": "qwen3",
-                "num_attention_heads": 32,
-                "num_hidden_layers": 36,
-                "num_key_value_heads": 8,
-                "rms_norm_eps": 1e-06,
-                "rope_scaling": None,
-                "rope_theta": 1000000,
-                "sliding_window": None,
-                "tie_word_embeddings": True,
-                "torch_dtype": "bfloat16",
-                "transformers_version": "4.51.0",
-                "use_cache": True,
-                "use_sliding_window": False,
-                "vocab_size": 151936
-            }),
-            "8B": Qwen3Config(**{
-                "architectures": [
-                    "Qwen3ForCausalLM"
-                ],
-                "attention_bias": False,
-                "attention_dropout": 0.0,
-                "bos_token_id": 151643,
-                "dtype": "bfloat16",
-                "eos_token_id": 151645,
-                "head_dim": 128,
-                "hidden_act": "silu",
-                "hidden_size": 4096,
-                "initializer_range": 0.02,
-                "intermediate_size": 12288,
-                "max_position_embeddings": 40960,
-                "max_window_layers": 36,
-                "model_type": "qwen3",
-                "num_attention_heads": 32,
-                "num_hidden_layers": 36,
-                "num_key_value_heads": 8,
-                "rms_norm_eps": 1e-06,
-                "rope_scaling": None,
-                "rope_theta": 1000000,
-                "sliding_window": None,
-                "tie_word_embeddings": False,
-                "transformers_version": "4.56.1",
-                "use_cache": True,
-                "use_sliding_window": False,
-                "vocab_size": 151936
-            })
-        }
-        config = config_dict[model_size]
+        config = Qwen3Config(**{
+            "architectures": [
+                "Qwen3ForCausalLM"
+            ],
+            "attention_bias": False,
+            "attention_dropout": 0.0,
+            "bos_token_id": 151643,
+            "eos_token_id": 151645,
+            "head_dim": 128,
+            "hidden_act": "silu",
+            "hidden_size": 2560,
+            "initializer_range": 0.02,
+            "intermediate_size": 9728,
+            "max_position_embeddings": 40960,
+            "max_window_layers": 36,
+            "model_type": "qwen3",
+            "num_attention_heads": 32,
+            "num_hidden_layers": 36,
+            "num_key_value_heads": 8,
+            "rms_norm_eps": 1e-06,
+            "rope_scaling": None,
+            "rope_theta": 1000000,
+            "sliding_window": None,
+            "tie_word_embeddings": True,
+            "torch_dtype": "bfloat16",
+            "transformers_version": "4.51.0",
+            "use_cache": True,
+            "use_sliding_window": False,
+            "vocab_size": 151936
+        })
        self.model = Qwen3Model(config)
    
    def forward(self, *args, **kwargs):
--- a/diffsynth/pipelines/flux2_image.py
+++ b/diffsynth/pipelines/flux2_image.py
@@ -1,4 +1,4 @@
-import torch, math, torchvision
+import torch, math
 from PIL import Image
 from typing import Union
 from tqdm import tqdm
@@ -6,28 +6,25 @@ from einops import rearrange
 import numpy as np
 from typing import Union, List, Optional, Tuple

-from ..core.device.npu_compatible_device import get_device_type
 from ..diffusion import FlowMatchScheduler
 from ..core import ModelConfig, gradient_checkpoint_forward
 from ..diffusion.base_pipeline import BasePipeline, PipelineUnit, ControlNetInput

-from transformers import AutoProcessor, AutoTokenizer
+from transformers import AutoProcessor
 from ..models.flux2_text_encoder import Flux2TextEncoder
 from ..models.flux2_dit import Flux2DiT
 from ..models.flux2_vae import Flux2VAE
-from ..models.z_image_text_encoder import ZImageTextEncoder


 class Flux2ImagePipeline(BasePipeline):

-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__(
            device=device, torch_dtype=torch_dtype,
            height_division_factor=16, width_division_factor=16,
        )
        self.scheduler = FlowMatchScheduler("FLUX.2")
        self.text_encoder: Flux2TextEncoder = None
-        self.text_encoder_qwen3: ZImageTextEncoder = None
        self.dit: Flux2DiT = None
        self.vae: Flux2VAE = None
        self.tokenizer: AutoProcessor = None
@@ -35,10 +32,8 @@ class Flux2ImagePipeline(BasePipeline):
        self.units = [
            Flux2Unit_ShapeChecker(),
            Flux2Unit_PromptEmbedder(),
-            Flux2Unit_Qwen3PromptEmbedder(),
            Flux2Unit_NoiseInitializer(),
            Flux2Unit_InputImageEmbedder(),
-            Flux2Unit_EditImageEmbedder(),
            Flux2Unit_ImageIDs(),
        ]
        self.model_fn = model_fn_flux2
@@ -47,7 +42,7 @@ class Flux2ImagePipeline(BasePipeline):
    @staticmethod
    def from_pretrained(
        torch_dtype: torch.dtype = torch.bfloat16,
-        device: Union[str, torch.device] = get_device_type(),
+        device: Union[str, torch.device] = "cuda",
        model_configs: list[ModelConfig] = [],
        tokenizer_config: ModelConfig = ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"),
        vram_limit: float = None,
@@ -58,12 +53,11 @@ class Flux2ImagePipeline(BasePipeline):
        
        # Fetch models
        pipe.text_encoder = model_pool.fetch_model("flux2_text_encoder")
-        pipe.text_encoder_qwen3 = model_pool.fetch_model("z_image_text_encoder")
        pipe.dit = model_pool.fetch_model("flux2_dit")
        pipe.vae = model_pool.fetch_model("flux2_vae")
        if tokenizer_config is not None:
            tokenizer_config.download_if_necessary()
-            pipe.tokenizer = AutoTokenizer.from_pretrained(tokenizer_config.path)
+            pipe.tokenizer = AutoProcessor.from_pretrained(tokenizer_config.path)
        
        # VRAM Management
        pipe.vram_management_enabled = pipe.check_vram_management_state()
@@ -81,9 +75,6 @@ class Flux2ImagePipeline(BasePipeline):
        # Image
        input_image: Image.Image = None,
        denoising_strength: float = 1.0,
-        # Edit
-        edit_image: Union[Image.Image, List[Image.Image]] = None,
-        edit_image_auto_resize: bool = True,
        # Shape
        height: int = 1024,
        width: int = 1024,
@@ -107,7 +98,6 @@ class Flux2ImagePipeline(BasePipeline):
        inputs_shared = {
            "cfg_scale": cfg_scale, "embedded_guidance": embedded_guidance,
            "input_image": input_image, "denoising_strength": denoising_strength,
-            "edit_image": edit_image, "edit_image_auto_resize": edit_image_auto_resize,
            "height": height, "width": width,
            "seed": seed, "rand_device": rand_device,
            "num_inference_steps": num_inference_steps,
@@ -285,10 +275,6 @@ class Flux2Unit_PromptEmbedder(PipelineUnit):
        return prompt_embeds, text_ids

    def process(self, pipe: Flux2ImagePipeline, prompt):
-        # Skip if Qwen3 text encoder is available (handled by Qwen3PromptEmbedder)
-        if pipe.text_encoder_qwen3 is not None:
-            return {}
-        
        pipe.load_models_to_device(self.onload_model_names)
        prompt_embeds, text_ids = self.encode_prompt(
            pipe.text_encoder, pipe.tokenizer, prompt,
@@ -297,135 +283,6 @@ class Flux2Unit_PromptEmbedder(PipelineUnit):
        return {"prompt_embeds": prompt_embeds, "text_ids": text_ids}


-class Flux2Unit_Qwen3PromptEmbedder(PipelineUnit):
-    def __init__(self):
-        super().__init__(
-            seperate_cfg=True,
-            input_params_posi={"prompt": "prompt"},
-            input_params_nega={"prompt": "negative_prompt"},
-            output_params=("prompt_emb", "prompt_emb_mask"),
-            onload_model_names=("text_encoder_qwen3",)
-        )
-        self.hidden_states_layers = (9, 18, 27)  # Qwen3 layers
-
-    def get_qwen3_prompt_embeds(
-        self,
-        text_encoder: ZImageTextEncoder,
-        tokenizer: AutoTokenizer,
-        prompt: Union[str, List[str]],
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        max_sequence_length: int = 512,
-    ):
-        dtype = text_encoder.dtype if dtype is None else dtype
-        device = text_encoder.device if device is None else device
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-
-        all_input_ids = []
-        all_attention_masks = []
-
-        for single_prompt in prompt:
-            messages = [{"role": "user", "content": single_prompt}]
-            text = tokenizer.apply_chat_template(
-                messages,
-                tokenize=False,
-                add_generation_prompt=True,
-                enable_thinking=False,
-            )
-            inputs = tokenizer(
-                text,
-                return_tensors="pt",
-                padding="max_length",
-                truncation=True,
-                max_length=max_sequence_length,
-            )
-
-            all_input_ids.append(inputs["input_ids"])
-            all_attention_masks.append(inputs["attention_mask"])
-
-        input_ids = torch.cat(all_input_ids, dim=0).to(device)
-        attention_mask = torch.cat(all_attention_masks, dim=0).to(device)
-
-        # Forward pass through the model
-        output = text_encoder(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            output_hidden_states=True,
-            use_cache=False,
-        )
-
-        # Only use outputs from intermediate layers and stack them
-        out = torch.stack([output.hidden_states[k] for k in self.hidden_states_layers], dim=1)
-        out = out.to(dtype=dtype, device=device)
-
-        batch_size, num_channels, seq_len, hidden_dim = out.shape
-        prompt_embeds = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, num_channels * hidden_dim)
-        return prompt_embeds
-
-    def prepare_text_ids(
-        self,
-        x: torch.Tensor,  # (B, L, D) or (L, D)
-        t_coord: Optional[torch.Tensor] = None,
-    ):
-        B, L, _ = x.shape
-        out_ids = []
-
-        for i in range(B):
-            t = torch.arange(1) if t_coord is None else t_coord[i]
-            h = torch.arange(1)
-            w = torch.arange(1)
-            l = torch.arange(L)
-
-            coords = torch.cartesian_prod(t, h, w, l)
-            out_ids.append(coords)
-
-        return torch.stack(out_ids)
-
-    def encode_prompt(
-        self,
-        text_encoder: ZImageTextEncoder,
-        tokenizer: AutoTokenizer,
-        prompt: Union[str, List[str]],
-        dtype = None,
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        max_sequence_length: int = 512,
-    ):
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-
-        if prompt_embeds is None:
-            prompt_embeds = self.get_qwen3_prompt_embeds(
-                text_encoder=text_encoder,
-                tokenizer=tokenizer,
-                prompt=prompt,
-                dtype=dtype,
-                device=device,
-                max_sequence_length=max_sequence_length,
-            )
-
-        batch_size, seq_len, _ = prompt_embeds.shape
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
-
-        text_ids = self.prepare_text_ids(prompt_embeds)
-        text_ids = text_ids.to(device)
-        return prompt_embeds, text_ids
-
-    def process(self, pipe: Flux2ImagePipeline, prompt):
-        # Check if Qwen3 text encoder is available
-        if pipe.text_encoder_qwen3 is None:
-            return {}
-        
-        pipe.load_models_to_device(self.onload_model_names)
-        prompt_embeds, text_ids = self.encode_prompt(
-            pipe.text_encoder_qwen3, pipe.tokenizer, prompt,
-            dtype=pipe.torch_dtype, device=pipe.device,
-        )
-        return {"prompt_embeds": prompt_embeds, "text_ids": text_ids}
-
-
 class Flux2Unit_NoiseInitializer(PipelineUnit):
    def __init__(self):
        super().__init__(
@@ -461,75 +318,6 @@ class Flux2Unit_InputImageEmbedder(PipelineUnit):
            return {"latents": latents, "input_latents": input_latents}


-class Flux2Unit_EditImageEmbedder(PipelineUnit):
-    def __init__(self):
-        super().__init__(
-            input_params=("edit_image", "edit_image_auto_resize"),
-            output_params=("edit_latents", "edit_image_ids"),
-            onload_model_names=("vae",)
-        )
-
-    def calculate_dimensions(self, target_area, ratio):
-        import math
-        width = math.sqrt(target_area * ratio)
-        height = width / ratio
-        width = round(width / 32) * 32
-        height = round(height / 32) * 32
-        return width, height
-    
-    def crop_and_resize(self, image, target_height, target_width):
-        width, height = image.size
-        scale = max(target_width / width, target_height / height)
-        image = torchvision.transforms.functional.resize(
-            image,
-            (round(height*scale), round(width*scale)),
-            interpolation=torchvision.transforms.InterpolationMode.BILINEAR
-        )
-        image = torchvision.transforms.functional.center_crop(image, (target_height, target_width))
-        return image
-
-    def edit_image_auto_resize(self, edit_image):
-        calculated_width, calculated_height = self.calculate_dimensions(1024 * 1024, edit_image.size[0] / edit_image.size[1])
-        return self.crop_and_resize(edit_image, calculated_height, calculated_width)
-    
-    def process_image_ids(self, image_latents, scale=10):
-        t_coords = [scale + scale * t for t in torch.arange(0, len(image_latents))]
-        t_coords = [t.view(-1) for t in t_coords]
-
-        image_latent_ids = []
-        for x, t in zip(image_latents, t_coords):
-            x = x.squeeze(0)
-            _, height, width = x.shape
-
-            x_ids = torch.cartesian_prod(t, torch.arange(height), torch.arange(width), torch.arange(1))
-            image_latent_ids.append(x_ids)
-
-        image_latent_ids = torch.cat(image_latent_ids, dim=0)
-        image_latent_ids = image_latent_ids.unsqueeze(0)
-
-        return image_latent_ids
-
-    def process(self, pipe: Flux2ImagePipeline, edit_image, edit_image_auto_resize):
-        if edit_image is None:
-            return {}
-        pipe.load_models_to_device(self.onload_model_names)
-        if isinstance(edit_image, Image.Image):
-            edit_image = [edit_image]
-        resized_edit_image, edit_latents = [], []
-        for image in edit_image:
-            # Preprocess
-            if edit_image_auto_resize is None or edit_image_auto_resize:
-                image = self.edit_image_auto_resize(image)
-            resized_edit_image.append(image)
-            # Encode
-            image = pipe.preprocess_image(image)
-            latents = pipe.vae.encode(image)
-            edit_latents.append(latents)
-        edit_image_ids = self.process_image_ids(edit_latents).to(pipe.device)
-        edit_latents = torch.concat([rearrange(latents, "B C H W -> B (H W) C") for latents in edit_latents], dim=1)
-        return {"edit_latents": edit_latents, "edit_image_ids": edit_image_ids}
-
-
 class Flux2Unit_ImageIDs(PipelineUnit):
    def __init__(self):
        super().__init__(
@@ -564,17 +352,10 @@ def model_fn_flux2(
    prompt_embeds=None,
    text_ids=None,
    image_ids=None,
-    edit_latents=None,
-    edit_image_ids=None,
    use_gradient_checkpointing=False,
    use_gradient_checkpointing_offload=False,
    **kwargs,
 ):
-    image_seq_len = latents.shape[1]
-    if edit_latents is not None:
-        image_seq_len = latents.shape[1]
-        latents = torch.concat([latents, edit_latents], dim=1)
-        image_ids = torch.concat([image_ids, edit_image_ids], dim=1)
    embedded_guidance = torch.tensor([embedded_guidance], device=latents.device)
    model_output = dit(
        hidden_states=latents,
@@ -586,5 +367,4 @@ def model_fn_flux2(
        use_gradient_checkpointing=use_gradient_checkpointing,
        use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
    )
-    model_output = model_output[:, :image_seq_len]
    return model_output
--- a/diffsynth/pipelines/flux_image.py
+++ b/diffsynth/pipelines/flux_image.py
@@ -6,7 +6,6 @@ from einops import rearrange, repeat
 import numpy as np
 from transformers import CLIPTokenizer, T5TokenizerFast

-from ..core.device.npu_compatible_device import get_device_type
 from ..diffusion import FlowMatchScheduler
 from ..core import ModelConfig, gradient_checkpoint_forward, load_state_dict
 from ..diffusion.base_pipeline import BasePipeline, PipelineUnit, ControlNetInput
@@ -56,7 +55,7 @@ class MultiControlNet(torch.nn.Module):

 class FluxImagePipeline(BasePipeline):

-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__(
            device=device, torch_dtype=torch_dtype,
            height_division_factor=16, width_division_factor=16,
@@ -118,7 +117,7 @@ class FluxImagePipeline(BasePipeline):
    @staticmethod
    def from_pretrained(
        torch_dtype: torch.dtype = torch.bfloat16,
-        device: Union[str, torch.device] = get_device_type(),
+        device: Union[str, torch.device] = "cuda",
        model_configs: list[ModelConfig] = [],
        tokenizer_1_config: ModelConfig = ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="tokenizer/"),
        tokenizer_2_config: ModelConfig = ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="tokenizer_2/"),
@@ -378,7 +377,7 @@ class FluxImageUnit_PromptEmbedder(PipelineUnit):
        text_encoder_2,
        prompt,
        positive=True,
-        device=get_device_type(),
+        device="cuda",
        t5_sequence_length=512,
    ):
        pooled_prompt_emb = self.encode_prompt_using_clip(prompt, text_encoder_1, tokenizer_1, 77, device)
@@ -559,7 +558,7 @@ class FluxImageUnit_EntityControl(PipelineUnit):
        text_encoder_2,
        prompt,
        positive=True,
-        device=get_device_type(),
+        device="cuda",
        t5_sequence_length=512,
    ):
        pooled_prompt_emb = self.encode_prompt_using_clip(prompt, text_encoder_1, tokenizer_1, 77, device)
@@ -794,7 +793,7 @@ class FluxImageUnit_ValueControl(PipelineUnit):


 class InfinitYou(torch.nn.Module):
-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__()
        from facexlib.recognition import init_recognition_model
        from insightface.app import FaceAnalysis
--- a/diffsynth/pipelines/qwen_image.py
+++ b/diffsynth/pipelines/qwen_image.py
@@ -6,7 +6,6 @@ from einops import rearrange
 import numpy as np
 from math import prod

-from ..core.device.npu_compatible_device import get_device_type
 from ..diffusion import FlowMatchScheduler
 from ..core import ModelConfig, gradient_checkpoint_forward
 from ..diffusion.base_pipeline import BasePipeline, PipelineUnit, ControlNetInput
@@ -23,7 +22,7 @@ from ..models.qwen_image_image2lora import QwenImageImage2LoRAModel

 class QwenImagePipeline(BasePipeline):

-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__(
            device=device, torch_dtype=torch_dtype,
            height_division_factor=16, width_division_factor=16,
@@ -61,7 +60,7 @@ class QwenImagePipeline(BasePipeline):
    @staticmethod
    def from_pretrained(
        torch_dtype: torch.dtype = torch.bfloat16,
-        device: Union[str, torch.device] = get_device_type(),
+        device: Union[str, torch.device] = "cuda",
        model_configs: list[ModelConfig] = [],
        tokenizer_config: ModelConfig = ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
        processor_config: ModelConfig = None,
--- a/diffsynth/pipelines/wan_video.py
+++ b/diffsynth/pipelines/wan_video.py
@@ -11,7 +11,6 @@ from typing import Optional
 from typing_extensions import Literal
 from transformers import Wav2Vec2Processor

-from ..core.device.npu_compatible_device import get_device_type
 from ..diffusion import FlowMatchScheduler
 from ..core import ModelConfig, gradient_checkpoint_forward
 from ..diffusion.base_pipeline import BasePipeline, PipelineUnit
@@ -31,7 +30,7 @@ from ..models.longcat_video_dit import LongCatVideoTransformer3DModel

 class WanVideoPipeline(BasePipeline):

-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__(
            device=device, torch_dtype=torch_dtype,
            height_division_factor=16, width_division_factor=16, time_division_factor=4, time_division_remainder=1
@@ -99,7 +98,7 @@ class WanVideoPipeline(BasePipeline):
    @staticmethod
    def from_pretrained(
        torch_dtype: torch.dtype = torch.bfloat16,
-        device: Union[str, torch.device] = get_device_type(),
+        device: Union[str, torch.device] = "cuda",
        model_configs: list[ModelConfig] = [],
        tokenizer_config: ModelConfig = ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
        audio_processor_config: ModelConfig = None,
@@ -123,15 +122,11 @@ class WanVideoPipeline(BasePipeline):
                    model_config.model_id = redirect_dict[model_config.origin_file_pattern][0]
                    model_config.origin_file_pattern = redirect_dict[model_config.origin_file_pattern][1]
        
+        # Initialize pipeline
+        pipe = WanVideoPipeline(device=device, torch_dtype=torch_dtype)
        if use_usp:
            from ..utils.xfuser import initialize_usp
            initialize_usp(device)
-            import torch.distributed as dist
-            from ..core.device.npu_compatible_device import get_device_name
-            if dist.is_available() and dist.is_initialized():
-                device = get_device_name()
-        # Initialize pipeline
-        pipe = WanVideoPipeline(device=device, torch_dtype=torch_dtype)
        model_pool = pipe.download_and_load_models(model_configs, vram_limit)
        
        # Fetch models
@@ -965,7 +960,7 @@ class WanVideoUnit_AnimateInpaint(PipelineUnit):
            onload_model_names=("vae",)
        )
        
-    def get_i2v_mask(self, lat_t, lat_h, lat_w, mask_len=1, mask_pixel_values=None, device=get_device_type()):
+    def get_i2v_mask(self, lat_t, lat_h, lat_w, mask_len=1, mask_pixel_values=None, device="cuda"):
        if mask_pixel_values is None:
            msk = torch.zeros(1, (lat_t-1) * 4 + 1, lat_h, lat_w, device=device)
        else:
@@ -1321,6 +1316,11 @@ def model_fn_wan_video(
    if tea_cache_update:
        x = tea_cache.update(x)
    else:
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+            return custom_forward
+        
        def create_custom_forward_vap(block, vap):
            def custom_forward(*inputs):
                return vap(block, *inputs)
@@ -1334,24 +1334,32 @@ def model_fn_wan_video(
                        x, x_vap = torch.utils.checkpoint.checkpoint(
                            create_custom_forward_vap(block, vap),
                            x, context, t_mod, freqs, x_vap, context_vap, t_mod_vap, freqs_vap, block_id,
-                            use_reentrant=False
+                            use_reentrant=False,
                        )
                elif use_gradient_checkpointing:
                    x, x_vap = torch.utils.checkpoint.checkpoint(
                        create_custom_forward_vap(block, vap),
                        x, context, t_mod, freqs, x_vap, context_vap, t_mod_vap, freqs_vap, block_id,
-                        use_reentrant=False
+                        use_reentrant=False,
                    )
                else:
                    x, x_vap = vap(block, x, context, t_mod, freqs, x_vap, context_vap, t_mod_vap, freqs_vap, block_id)
            else:
-                x = gradient_checkpoint_forward(
-                    block,
-                    use_gradient_checkpointing,
-                    use_gradient_checkpointing_offload,
-                    x, context, t_mod, freqs
-                )
-              
+                if use_gradient_checkpointing_offload:
+                    with torch.autograd.graph.save_on_cpu():
+                        x = torch.utils.checkpoint.checkpoint(
+                            create_custom_forward(block),
+                            x, context, t_mod, freqs,
+                            use_reentrant=False,
+                        )
+                elif use_gradient_checkpointing:
+                    x = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(block),
+                        x, context, t_mod, freqs,
+                        use_reentrant=False,
+                    )
+                else:
+                    x = block(x, context, t_mod, freqs)
            
            # VACE
            if vace_context is not None and block_id in vace.vace_layers_mapping:
@@ -1474,18 +1482,32 @@ def model_fn_wans2v(
        return custom_forward

    for block_id, block in enumerate(dit.blocks):
-        x = gradient_checkpoint_forward(
-                block,
-                use_gradient_checkpointing,
-                use_gradient_checkpointing_offload,
-                x, context, t_mod, seq_len_x, pre_compute_freqs[0]
+        if use_gradient_checkpointing_offload:
+            with torch.autograd.graph.save_on_cpu():
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    x, context, t_mod, seq_len_x, pre_compute_freqs[0],
+                    use_reentrant=False,
+                )
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(lambda x: dit.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x)),
+                    x,
+                    use_reentrant=False,
+                )
+        elif use_gradient_checkpointing:
+            x = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(block),
+                x, context, t_mod, seq_len_x, pre_compute_freqs[0],
+                use_reentrant=False,
            )
-        x = gradient_checkpoint_forward(
-            lambda x: dit.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x),
-            use_gradient_checkpointing,
-            use_gradient_checkpointing_offload,
-            x
-        )
+            x = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(lambda x: dit.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x)),
+                x,
+                use_reentrant=False,
+            )
+        else:
+            x = block(x, context, t_mod, seq_len_x, pre_compute_freqs[0])
+            x = dit.after_transformer_block(block_id, x, audio_emb_global, merged_audio_emb, seq_len_x_global, use_unified_sequence_parallel)

    if use_unified_sequence_parallel and dist.is_initialized() and dist.get_world_size() > 1:
        x = get_sp_group().all_gather(x, dim=1)
--- a/diffsynth/pipelines/z_image.py
+++ b/diffsynth/pipelines/z_image.py
@@ -4,29 +4,23 @@ from typing import Union
 from tqdm import tqdm
 from einops import rearrange
 import numpy as np
-from typing import Union, List, Optional, Tuple, Iterable, Dict
+from typing import Union, List, Optional, Tuple, Iterable

-from ..core.device.npu_compatible_device import get_device_type
 from ..diffusion import FlowMatchScheduler
 from ..core import ModelConfig, gradient_checkpoint_forward
 from ..core.data.operators import ImageCropAndResize
 from ..diffusion.base_pipeline import BasePipeline, PipelineUnit, ControlNetInput
-from ..utils.lora import merge_lora

 from transformers import AutoTokenizer
 from ..models.z_image_text_encoder import ZImageTextEncoder
 from ..models.z_image_dit import ZImageDiT
 from ..models.flux_vae import FluxVAEEncoder, FluxVAEDecoder
 from ..models.siglip2_image_encoder import Siglip2ImageEncoder428M
-from ..models.z_image_controlnet import ZImageControlNet
-from ..models.siglip2_image_encoder import Siglip2ImageEncoder
-from ..models.dinov3_image_encoder import DINOv3ImageEncoder
-from ..models.z_image_image2lora import ZImageImage2LoRAModel


 class ZImagePipeline(BasePipeline):

-    def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
+    def __init__(self, device="cuda", torch_dtype=torch.bfloat16):
        super().__init__(
            device=device, torch_dtype=torch_dtype,
            height_division_factor=16, width_division_factor=16,
@@ -37,12 +31,8 @@ class ZImagePipeline(BasePipeline):
        self.vae_encoder: FluxVAEEncoder = None
        self.vae_decoder: FluxVAEDecoder = None
        self.image_encoder: Siglip2ImageEncoder428M = None
-        self.controlnet: ZImageControlNet = None
-        self.siglip2_image_encoder: Siglip2ImageEncoder = None
-        self.dinov3_image_encoder: DINOv3ImageEncoder = None
-        self.image2lora_style: ZImageImage2LoRAModel = None
        self.tokenizer: AutoTokenizer = None
-        self.in_iteration_models = ("dit", "controlnet")
+        self.in_iteration_models = ("dit",)
        self.units = [
            ZImageUnit_ShapeChecker(),
            ZImageUnit_PromptEmbedder(),
@@ -51,7 +41,6 @@ class ZImagePipeline(BasePipeline):
            ZImageUnit_EditImageAutoResize(),
            ZImageUnit_EditImageEmbedderVAE(),
            ZImageUnit_EditImageEmbedderSiglip(),
-            ZImageUnit_PAIControlNet(),
        ]
        self.model_fn = model_fn_z_image
    
@@ -59,7 +48,7 @@ class ZImagePipeline(BasePipeline):
    @staticmethod
    def from_pretrained(
        torch_dtype: torch.dtype = torch.bfloat16,
-        device: Union[str, torch.device] = get_device_type(),
+        device: Union[str, torch.device] = "cuda",
        model_configs: list[ModelConfig] = [],
        tokenizer_config: ModelConfig = ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="tokenizer/"),
        vram_limit: float = None,
@@ -74,10 +63,6 @@ class ZImagePipeline(BasePipeline):
        pipe.vae_encoder = model_pool.fetch_model("flux_vae_encoder")
        pipe.vae_decoder = model_pool.fetch_model("flux_vae_decoder")
        pipe.image_encoder = model_pool.fetch_model("siglip_vision_model_428m")
-        pipe.controlnet = model_pool.fetch_model("z_image_controlnet")
-        pipe.siglip2_image_encoder = model_pool.fetch_model("siglip2_image_encoder")
-        pipe.dinov3_image_encoder = model_pool.fetch_model("dinov3_image_encoder")
-        pipe.image2lora_style = model_pool.fetch_model("z_image_image2lora_style")
        if tokenizer_config is not None:
            tokenizer_config.download_if_necessary()
            pipe.tokenizer = AutoTokenizer.from_pretrained(tokenizer_config.path)
@@ -109,11 +94,6 @@ class ZImagePipeline(BasePipeline):
        # Steps
        num_inference_steps: int = 8,
        sigma_shift: float = None,
-        # ControlNet
-        controlnet_inputs: List[ControlNetInput] = None,
-        # Image to LoRA
-        image2lora_images: List[Image.Image] = None,
-        positive_only_lora: Dict[str, torch.Tensor] = None,
        # Progress bar
        progress_bar_cmd = tqdm,
    ):
@@ -134,8 +114,6 @@ class ZImagePipeline(BasePipeline):
            "seed": seed, "rand_device": rand_device,
            "num_inference_steps": num_inference_steps,
            "edit_image": edit_image, "edit_image_auto_resize": edit_image_auto_resize,
-            "controlnet_inputs": controlnet_inputs,
-            "image2lora_images": image2lora_images, "positive_only_lora": positive_only_lora,
        }
        for unit in self.units:
            inputs_shared, inputs_posi, inputs_nega = self.unit_runner(unit, self, inputs_shared, inputs_posi, inputs_nega)
@@ -353,9 +331,7 @@ class ZImageUnit_EditImageAutoResize(PipelineUnit):
        if edit_image_auto_resize is None or not edit_image_auto_resize:
            return {}
        operator = ImageCropAndResize(max_pixels=1024*1024, height_division_factor=16, width_division_factor=16)
-        if not isinstance(edit_image, list):
-            edit_image = [edit_image]
-        edit_image = [operator(i) for i in edit_image]
+        edit_image = operator(edit_image)
        return {"edit_image": edit_image}


@@ -400,49 +376,8 @@ class ZImageUnit_EditImageEmbedderVAE(PipelineUnit):
        return {"image_latents": image_latents}


-class ZImageUnit_PAIControlNet(PipelineUnit):
-    def __init__(self):
-        super().__init__(
-            input_params=("controlnet_inputs", "height", "width"),
-            output_params=("control_context", "control_scale"),
-            onload_model_names=("vae_encoder",)
-        )
-
-    def process(self, pipe: ZImagePipeline, controlnet_inputs: List[ControlNetInput], height, width):
-        if controlnet_inputs is None:
-            return {}
-        if len(controlnet_inputs) != 1:
-            print("Z-Image ControlNet doesn't support multi-ControlNet. Only one image will be used.")
-        controlnet_input = controlnet_inputs[0]
-        pipe.load_models_to_device(self.onload_model_names)
-
-        control_image = controlnet_input.image
-        if control_image is not None:
-            control_image = pipe.preprocess_image(control_image)
-            control_latents = pipe.vae_encoder(control_image)
-        else:
-            control_latents = torch.ones((1, 16, height // 8, width // 8), dtype=pipe.torch_dtype, device=pipe.device) * -1
-        
-        inpaint_mask = controlnet_input.inpaint_mask
-        if inpaint_mask is not None:
-            inpaint_mask = pipe.preprocess_image(inpaint_mask, min_value=0, max_value=1)
-            inpaint_image = controlnet_input.inpaint_image
-            inpaint_image = pipe.preprocess_image(inpaint_image)
-            inpaint_image = inpaint_image * (inpaint_mask < 0.5)
-            inpaint_mask = torch.nn.functional.interpolate(1 - inpaint_mask, (height // 8, width // 8), mode='nearest')[:, :1]
-        else:
-            inpaint_mask = torch.zeros((1, 1, height // 8, width // 8), dtype=pipe.torch_dtype, device=pipe.device)
-            inpaint_image = torch.zeros((1, 3, height, width), dtype=pipe.torch_dtype, device=pipe.device)
-        inpaint_latent = pipe.vae_encoder(inpaint_image)
-
-        control_context = torch.concat([control_latents, inpaint_mask, inpaint_latent], dim=1)
-        control_context = rearrange(control_context, "B C H W -> B C 1 H W")
-        return {"control_context": control_context, "control_scale": controlnet_input.scale}
-
-
 def model_fn_z_image(
    dit: ZImageDiT,
-    controlnet: ZImageControlNet = None,
    latents=None,
    timestep=None,
    prompt_embeds=None,
@@ -458,14 +393,13 @@ def model_fn_z_image(
    if dit.siglip_embedder is None:
        return model_fn_z_image_turbo(
            dit,
-            controlnet=controlnet,
-            latents=latents,
-            timestep=timestep,
-            prompt_embeds=prompt_embeds,
-            image_embeds=image_embeds,
-            image_latents=image_latents,
-            use_gradient_checkpointing=use_gradient_checkpointing,
-            use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
+            latents,
+            timestep,
+            prompt_embeds,
+            image_embeds,
+            image_latents,
+            use_gradient_checkpointing,
+            use_gradient_checkpointing_offload,
            **kwargs,
        )
    latents = [rearrange(latents, "B C H W -> C B H W")]
@@ -495,81 +429,13 @@ def model_fn_z_image(
    return model_output


-class ZImageUnit_Image2LoRAEncode(PipelineUnit):
-    def __init__(self):
-        super().__init__(
-            input_params=("image2lora_images",),
-            output_params=("image2lora_x",),
-            onload_model_names=("siglip2_image_encoder", "dinov3_image_encoder",),
-        )
-        from ..core.data.operators import ImageCropAndResize
-        self.processor_highres = ImageCropAndResize(height=1024, width=1024)
-    
-    def encode_images_using_siglip2(self, pipe: ZImagePipeline, images: list[Image.Image]):
-        pipe.load_models_to_device(["siglip2_image_encoder"])
-        embs = []
-        for image in images:
-            image = self.processor_highres(image)
-            embs.append(pipe.siglip2_image_encoder(image).to(pipe.torch_dtype))
-        embs = torch.stack(embs)
-        return embs
-    
-    def encode_images_using_dinov3(self, pipe: ZImagePipeline, images: list[Image.Image]):
-        pipe.load_models_to_device(["dinov3_image_encoder"])
-        embs = []
-        for image in images:
-            image = self.processor_highres(image)
-            embs.append(pipe.dinov3_image_encoder(image).to(pipe.torch_dtype))
-        embs = torch.stack(embs)
-        return embs
-
-    def encode_images(self, pipe: ZImagePipeline, images: list[Image.Image]):
-        if images is None:
-            return {}
-        if not isinstance(images, list):
-            images = [images]
-        embs_siglip2 = self.encode_images_using_siglip2(pipe, images)
-        embs_dinov3 = self.encode_images_using_dinov3(pipe, images)
-        x = torch.concat([embs_siglip2, embs_dinov3], dim=-1)
-        return x
-
-    def process(self, pipe: ZImagePipeline, image2lora_images):
-        if image2lora_images is None:
-            return {}
-        x = self.encode_images(pipe, image2lora_images)
-        return {"image2lora_x": x}
-
-
-class ZImageUnit_Image2LoRADecode(PipelineUnit):
-    def __init__(self):
-        super().__init__(
-            input_params=("image2lora_x",),
-            output_params=("lora",),
-            onload_model_names=("image2lora_style",),
-        )
-    
-    def process(self, pipe: ZImagePipeline, image2lora_x):
-        if image2lora_x is None:
-            return {}
-        loras = []
-        if pipe.image2lora_style is not None:
-            pipe.load_models_to_device(["image2lora_style"])
-            for x in image2lora_x:
-                loras.append(pipe.image2lora_style(x=x, residual=None))
-        lora = merge_lora(loras, alpha=1 / len(image2lora_x))
-        return {"lora": lora}
-
-
 def model_fn_z_image_turbo(
    dit: ZImageDiT,
-    controlnet: ZImageControlNet = None,
    latents=None,
    timestep=None,
    prompt_embeds=None,
    image_embeds=None,
    image_latents=None,
-    control_context=None,
-    control_scale=None,
    use_gradient_checkpointing=False,
    use_gradient_checkpointing_offload=False,
    **kwargs,
@@ -594,19 +460,11 @@ def model_fn_z_image_turbo(

    # Noise refine
    x = dit.all_x_embedder["2-1"](x)
-    x[torch.cat(patch_metadata.get("x_pad_mask"))] = dit.x_pad_token.to(dtype=x.dtype, device=x.device)
    x_freqs_cis = dit.rope_embedder(torch.cat(patch_metadata.get("x_pos_ids"), dim=0))
    x = rearrange(x, "L C -> 1 L C")
    x_freqs_cis = rearrange(x_freqs_cis, "L C -> 1 L C")
-
-    if control_context is not None:
-        kwargs = dict(attn_mask=None, freqs_cis=x_freqs_cis, adaln_input=t_noisy)
-        refiner_hints, control_context, control_context_item_seqlens = controlnet.forward_refiner(
-            dit, x, [cap_feats], control_context, kwargs, t=t_noisy, patch_size=2, f_patch_size=1,
-            use_gradient_checkpointing=use_gradient_checkpointing, use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
-        )
    
-    for layer_id, layer in enumerate(dit.noise_refiner):
+    for layer in dit.noise_refiner:
        x = gradient_checkpoint_forward(
            layer,
            use_gradient_checkpointing=use_gradient_checkpointing,
@@ -616,8 +474,6 @@ def model_fn_z_image_turbo(
            freqs_cis=x_freqs_cis,
            adaln_input=t_noisy,
        )
-        if control_context is not None:
-            x = x + refiner_hints[layer_id] * control_scale

    # Prompt refine
    cap_feats = dit.cap_embedder(cap_feats)
@@ -639,15 +495,7 @@ def model_fn_z_image_turbo(
    # Unified
    unified = torch.cat([x, cap_feats], dim=1)
    unified_freqs_cis = torch.cat([x_freqs_cis, cap_freqs_cis], dim=1)
-
-    if control_context is not None:
-        kwargs = dict(attn_mask=None, freqs_cis=unified_freqs_cis, adaln_input=t_noisy)
-        hints = controlnet.forward_layers(
-            unified, cap_feats, control_context, control_context_item_seqlens, kwargs,
-            use_gradient_checkpointing=use_gradient_checkpointing, use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
-        )
-
-    for layer_id, layer in enumerate(dit.layers):
+    for layer in dit.layers:
        unified = gradient_checkpoint_forward(
            layer,
            use_gradient_checkpointing=use_gradient_checkpointing,
@@ -657,9 +505,6 @@ def model_fn_z_image_turbo(
            freqs_cis=unified_freqs_cis,
            adaln_input=t_noisy,
        )
-        if control_context is not None:
-            if layer_id in controlnet.control_layers_mapping:
-                unified = unified + hints[controlnet.control_layers_mapping[layer_id]] * control_scale
    
    # Output
    unified = dit.all_final_layer["2-1"](unified, t_noisy)
--- a/diffsynth/utils/controlnet/annotator.py
+++ b/diffsynth/utils/controlnet/annotator.py
@@ -1,13 +1,12 @@
 from typing_extensions import Literal, TypeAlias

-from diffsynth.core.device.npu_compatible_device import get_device_type

 Processor_id: TypeAlias = Literal[
    "canny", "depth", "softedge", "lineart", "lineart_anime", "openpose", "normal", "tile", "none", "inpaint"
 ]

 class Annotator:
-    def __init__(self, processor_id: Processor_id, model_path="models/Annotators", detect_resolution=None, device=get_device_type(), skip_processor=False):
+    def __init__(self, processor_id: Processor_id, model_path="models/Annotators", detect_resolution=None, device='cuda', skip_processor=False):
        if not skip_processor:
            if processor_id == "canny":
                from controlnet_aux.processor import CannyDetector
--- a/diffsynth/utils/controlnet/controlnet_input.py
+++ b/diffsynth/utils/controlnet/controlnet_input.py
@@ -9,6 +9,5 @@ class ControlNetInput:
    start: float = 1.0
    end: float = 0.0
    image: Image.Image = None
-    inpaint_image: Image.Image = None
    inpaint_mask: Image.Image = None
    processor_id: str = None
--- a/diffsynth/utils/lora/flux.py
+++ b/diffsynth/utils/lora/flux.py
@@ -149,8 +149,6 @@ class FluxLoRALoader(GeneralLoRALoader):
                                        dtype=state_dict_[name].dtype)
                    else:
                        state_dict_.pop(name.replace(".a_to_q.", ".proj_in_besides_attn."))
-
-                    mlp = mlp.to(device=state_dict_[name].device)
                    if 'lora_A' in name:
                        param = torch.concat([
                            state_dict_.pop(name),
--- a/diffsynth/utils/state_dict_converters/flux_dit.py
+++ b/diffsynth/utils/state_dict_converters/flux_dit.py
@@ -89,109 +89,4 @@ def FluxDiTStateDictConverter(state_dict):
                state_dict_[rename] = state_dict[original_name]
        else:
            pass
-    return state_dict_
-
-
-def FluxDiTStateDictConverterFromDiffusers(state_dict):
-    global_rename_dict = {
-        "context_embedder": "context_embedder",
-        "x_embedder": "x_embedder",
-        "time_text_embed.timestep_embedder.linear_1": "time_embedder.timestep_embedder.0",
-        "time_text_embed.timestep_embedder.linear_2": "time_embedder.timestep_embedder.2",
-        "time_text_embed.guidance_embedder.linear_1": "guidance_embedder.timestep_embedder.0",
-        "time_text_embed.guidance_embedder.linear_2": "guidance_embedder.timestep_embedder.2",
-        "time_text_embed.text_embedder.linear_1": "pooled_text_embedder.0",
-        "time_text_embed.text_embedder.linear_2": "pooled_text_embedder.2",
-        "norm_out.linear": "final_norm_out.linear",
-        "proj_out": "final_proj_out",
-    }
-    rename_dict = {
-        "proj_out": "proj_out",
-        "norm1.linear": "norm1_a.linear",
-        "norm1_context.linear": "norm1_b.linear",
-        "attn.to_q": "attn.a_to_q",
-        "attn.to_k": "attn.a_to_k",
-        "attn.to_v": "attn.a_to_v",
-        "attn.to_out.0": "attn.a_to_out",
-        "attn.add_q_proj": "attn.b_to_q",
-        "attn.add_k_proj": "attn.b_to_k",
-        "attn.add_v_proj": "attn.b_to_v",
-        "attn.to_add_out": "attn.b_to_out",
-        "ff.net.0.proj": "ff_a.0",
-        "ff.net.2": "ff_a.2",
-        "ff_context.net.0.proj": "ff_b.0",
-        "ff_context.net.2": "ff_b.2",
-        "attn.norm_q": "attn.norm_q_a",
-        "attn.norm_k": "attn.norm_k_a",
-        "attn.norm_added_q": "attn.norm_q_b",
-        "attn.norm_added_k": "attn.norm_k_b",
-    }
-    rename_dict_single = {
-        "attn.to_q": "a_to_q",
-        "attn.to_k": "a_to_k",
-        "attn.to_v": "a_to_v",
-        "attn.norm_q": "norm_q_a",
-        "attn.norm_k": "norm_k_a",
-        "norm.linear": "norm.linear",
-        "proj_mlp": "proj_in_besides_attn",
-        "proj_out": "proj_out",
-    }
-    state_dict_ = {}
-    for name in state_dict:
-        param = state_dict[name]
-        if name.endswith(".weight") or name.endswith(".bias"):
-            suffix = ".weight" if name.endswith(".weight") else ".bias"
-            prefix = name[:-len(suffix)]
-            if prefix in global_rename_dict:
-                if global_rename_dict[prefix] == "final_norm_out.linear":
-                    param = torch.concat([param[3072:], param[:3072]], dim=0)
-                state_dict_[global_rename_dict[prefix] + suffix] = param
-            elif prefix.startswith("transformer_blocks."):
-                names = prefix.split(".")
-                names[0] = "blocks"
-                middle = ".".join(names[2:])
-                if middle in rename_dict:
-                    name_ = ".".join(names[:2] + [rename_dict[middle]] + [suffix[1:]])
-                    state_dict_[name_] = param
-            elif prefix.startswith("single_transformer_blocks."):
-                names = prefix.split(".")
-                names[0] = "single_blocks"
-                middle = ".".join(names[2:])
-                if middle in rename_dict_single:
-                    name_ = ".".join(names[:2] + [rename_dict_single[middle]] + [suffix[1:]])
-                    state_dict_[name_] = param
-                else:
-                    pass
-            else:
-                pass
-    for name in list(state_dict_.keys()):
-        if "single_blocks." in name and ".a_to_q." in name:
-            mlp = state_dict_.get(name.replace(".a_to_q.", ".proj_in_besides_attn."), None)
-            if mlp is None:
-                mlp = torch.zeros(4 * state_dict_[name].shape[0],
-                                    *state_dict_[name].shape[1:],
-                                    dtype=state_dict_[name].dtype)
-            else:
-                state_dict_.pop(name.replace(".a_to_q.", ".proj_in_besides_attn."))
-            param = torch.concat([
-                state_dict_.pop(name),
-                state_dict_.pop(name.replace(".a_to_q.", ".a_to_k.")),
-                state_dict_.pop(name.replace(".a_to_q.", ".a_to_v.")),
-                mlp,
-            ], dim=0)
-            name_ = name.replace(".a_to_q.", ".to_qkv_mlp.")
-            state_dict_[name_] = param
-    for name in list(state_dict_.keys()):
-        for component in ["a", "b"]:
-            if f".{component}_to_q." in name:
-                name_ = name.replace(f".{component}_to_q.", f".{component}_to_qkv.")
-                param = torch.concat([
-                    state_dict_[name.replace(f".{component}_to_q.", f".{component}_to_q.")],
-                    state_dict_[name.replace(f".{component}_to_q.", f".{component}_to_k.")],
-                    state_dict_[name.replace(f".{component}_to_q.", f".{component}_to_v.")],
-                ], dim=0)
-                state_dict_[name_] = param
-                state_dict_.pop(name.replace(f".{component}_to_q.", f".{component}_to_q."))
-                state_dict_.pop(name.replace(f".{component}_to_q.", f".{component}_to_k."))
-                state_dict_.pop(name.replace(f".{component}_to_q.", f".{component}_to_v."))
    return state_dict_
--- a/diffsynth/utils/state_dict_converters/z_image_text_encoder.py
+++ b/diffsynth/utils/state_dict_converters/z_image_text_encoder.py
@@ -1,6 +0,0 @@
-def ZImageTextEncoderStateDictConverter(state_dict):
-    state_dict_ = {}
-    for name in state_dict:
-        if name != "lm_head.weight":
-            state_dict_[name] = state_dict[name]
-    return state_dict_
--- a/diffsynth/utils/xfuser/xdit_context_parallel.py
+++ b/diffsynth/utils/xfuser/xdit_context_parallel.py
@@ -6,7 +6,6 @@ from xfuser.core.distributed import (get_sequence_parallel_rank,
                                     get_sp_group)
 from xfuser.core.long_ctx_attention import xFuserLongContextAttention
 from ...core.device import parse_nccl_backend, parse_device_type
-from ...core.gradient import gradient_checkpoint_forward


 def initialize_usp(device_type):
@@ -51,7 +50,7 @@ def rope_apply(x, freqs, num_heads):
    sp_rank = get_sequence_parallel_rank()
    freqs = pad_freqs(freqs, s_per_rank * sp_size)
    freqs_rank = freqs[(sp_rank * s_per_rank):((sp_rank + 1) * s_per_rank), :, :]
-    freqs_rank = freqs_rank.to(torch.complex64) if freqs_rank.device == "npu" else freqs_rank
+
    x_out = torch.view_as_real(x_out * freqs_rank).flatten(2)
    return x_out.to(x.dtype)

@@ -82,6 +81,11 @@ def usp_dit_forward(self,
        self.freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
        self.freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
    ], dim=-1).reshape(f * h * w, 1, -1).to(x.device)
+    
+    def create_custom_forward(module):
+        def custom_forward(*inputs):
+            return module(*inputs)
+        return custom_forward

    # Context Parallel
    chunks = torch.chunk(x, get_sequence_parallel_world_size(), dim=1)
@@ -90,13 +94,20 @@ def usp_dit_forward(self,
    x = chunks[get_sequence_parallel_rank()]

    for block in self.blocks:
-        if self.training:
-            x = gradient_checkpoint_forward(
-                block,
-                use_gradient_checkpointing,
-                use_gradient_checkpointing_offload,
-                x, context, t_mod, freqs
-            )
+        if self.training and use_gradient_checkpointing:
+            if use_gradient_checkpointing_offload:
+                with torch.autograd.graph.save_on_cpu():
+                    x = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(block),
+                        x, context, t_mod, freqs,
+                        use_reentrant=False,
+                    )
+            else:
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    x, context, t_mod, freqs,
+                    use_reentrant=False,
+                )
        else:
            x = block(x, context, t_mod, freqs)

--- a/docs/en/Model_Details/FLUX2.md
+++ b/docs/en/Model_Details/FLUX2.md
@@ -2,15 +2,6 @@

 FLUX.2 is an image generation model trained and open-sourced by Black Forest Labs.

-## Model Lineage
-
-```mermaid
-graph LR;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-dev;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-klein-4B;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-klein-9B;
-```
-
 ## Installation

 Before using this project for model inference and training, please install DiffSynth-Studio first.
@@ -59,20 +50,16 @@ image.save("image.jpg")

 ## Model Overview

-| Model ID | Inference | Low VRAM Inference | Full Training | Validation After Full Training | LoRA Training | Validation After LoRA Training |
-| - | - | - | - | - | - | - |
-|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|-|-|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|
-|[black-forest-labs/FLUX.2-klein-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py)|
-|[black-forest-labs/FLUX.2-klein-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py)|
-|[black-forest-labs/FLUX.2-klein-base-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py)|
-|[black-forest-labs/FLUX.2-klein-base-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py)|
+| Model ID | Inference | Low VRAM Inference | LoRA Training | Validation After LoRA Training |
+| - | - | - | - | - |
+| [black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev) | [code](/examples/flux2/model_inference/FLUX.2-dev.py) | [code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py) | [code](/examples/flux2/model_training/lora/FLUX.2-dev.sh) | [code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py) |

 Special Training Scripts:

-* Differential LoRA Training: [doc](/docs/en/Training/Differential_LoRA.md)
-* FP8 Precision Training: [doc](/docs/en/Training/FP8_Precision.md)
-* Two-stage Split Training: [doc](/docs/en/Training/Split_Training.md)
-* End-to-end Direct Distillation: [doc](/docs/en/Training/Direct_Distill.md)
+* Differential LoRA Training: [doc](/docs/en/Training/Differential_LoRA.md), [code](/examples/flux/model_training/special/differential_training/)
+* FP8 Precision Training: [doc](/docs/en/Training/FP8_Precision.md), [code](/examples/flux/model_training/special/fp8_training/)
+* Two-stage Split Training: [doc](/docs/en/Training/Split_Training.md), [code](/examples/flux/model_training/special/split_training/)
+* End-to-end Direct Distillation: [doc](/docs/en/Training/Direct_Distill.md), [code](/examples/flux/model_training/lora/FLUX.1-dev-Distill-LoRA.sh)

 ## Model Inference

@@ -148,4 +135,4 @@ We have built a sample image dataset for your testing. You can download this dat
 modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
 ```

-We have written recommended training scripts for each model, please refer to the table in the "Model Overview" section above. For how to write model training scripts, please refer to [Model Training](/docs/en/Pipeline_Usage/Model_Training.md); for more advanced training algorithms, please refer to [Training Framework Detailed Explanation](/docs/Training/).
+We have written recommended training scripts for each model, please refer to the table in the "Model Overview" section above. For how to write model training scripts, please refer to [Model Training](/docs/en/Pipeline_Usage/Model_Training.md); for more advanced training algorithms, please refer to [Training Framework Detailed Explanation](/docs/Training/).
--- a/docs/en/Model_Details/Qwen-Image.md
+++ b/docs/en/Model_Details/Qwen-Image.md
@@ -86,7 +86,6 @@ graph LR;
 | [Qwen/Qwen-Image-Edit-2509](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2509) | [code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2509.py) | [code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2509.py) | [code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2509.sh) | [code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2509.py) | [code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2509.sh) | [code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2509.py) |
 |[Qwen/Qwen-Image-Edit-2511](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2511)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2511.py)|
 |[Qwen/Qwen-Image-Layered](https://www.modelscope.cn/models/Qwen/Qwen-Image-Layered)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered.py)|
-|[DiffSynth-Studio/Qwen-Image-Layered-Control](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py)|
 | [DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen) | [code](/examples/qwen_image/model_inference/Qwen-Image-EliGen.py) | [code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py) | - | - | [code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh) | [code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py) |
 | [DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2) | [code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-V2.py) | [code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py) | - | - | [code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh) | [code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py) |
 | [DiffSynth-Studio/Qwen-Image-EliGen-Poster](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-Poster) | [code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-Poster.py) | [code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-Poster.py) | - | - | [code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen-Poster.sh) | [code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen-Poster.py) |
--- a/docs/en/Pipeline_Usage/GPU_support.md
+++ b/docs/en/Pipeline_Usage/GPU_support.md
@@ -13,7 +13,7 @@ All sample code provided by this project supports NVIDIA GPUs by default, requir
 AMD provides PyTorch packages based on ROCm, so most models can run without code changes. A small number of models may not be compatible due to their reliance on CUDA-specific instructions.

 ## Ascend NPU
-### Inference
+
 When using Ascend NPU, you need to replace `"cuda"` with `"npu"` in your code.

 For example, here is the inference code for **Wan2.1-T2V-1.3B**, modified for Ascend NPU:
@@ -22,7 +22,6 @@ For example, here is the inference code for **Wan2.1-T2V-1.3B**, modified for As
 import torch
 from diffsynth.utils.data import save_video, VideoData
 from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
-from diffsynth.core.device.npu_compatible_device import get_device_name

 vram_config = {
    "offload_dtype": "disk",
@@ -47,7 +46,7 @@ pipe = WanVideoPipeline.from_pretrained(
    ],
    tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
 -   vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2,
-+   vram_limit=torch.npu.mem_get_info(get_device_name())[1] / (1024 ** 3) - 2,
+   vram_limit=torch.npu.mem_get_info("npu:0")[1] / (1024 ** 3) - 2,
 )

 video = pipe(
@@ -57,28 +56,3 @@ video = pipe(
 )
 save_video(video, "video.mp4", fps=15, quality=5)
 ```
-
-### Training
-NPU startup script samples have been added for each type of model,the scripts are stored in the `examples/xxx/special/npu_training`, for example `examples/wanvideo/model_training/special/npu_training/Wan2.2-T2V-A14B-NPU.sh`.
-
-In the NPU training scripts, NPU specific environment variables that can optimize performance have been added, and relevant parameters have been enabled for specific models.
-
-#### Environment variables
-```shell
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-```
-`expandable_segments:<value>`: Enable the memory pool expansion segment function, which is the virtual memory feature.
-
-```shell
-export CPU_AFFINITY_CONF=1
-```
-Set 0 or not set: indicates not enabling the binding function
-
-1: Indicates enabling coarse-grained kernel binding
-
-2: Indicates enabling fine-grained kernel binding
-
-#### Parameters for specific models
-| Model          | Parameter                 | Note              |
-|----------------|---------------------------|-------------------|
-| Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU |
--- a/docs/en/Pipeline_Usage/Setup.md
+++ b/docs/en/Pipeline_Usage/Setup.md
@@ -30,16 +30,11 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6

 * **Ascend NPU**

-1. Install [CANN](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/softwareinst/instg/instg_quick.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit) through official documentation.
+Ascend NPU support is provided via the `torch-npu` package. Taking version `2.1.0.post17` (as of the article update date: December 15, 2025) as an example, run the following command:

-2. Install from source
-   ```shell
-   git clone https://github.com/modelscope/DiffSynth-Studio.git
-   cd DiffSynth-Studio
-   # aarch64/ARM
-   pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu"
-   # x86
-   pip install -e .[npu]
+```shell
+pip install torch-npu==2.1.0.post17
+```

 When using Ascend NPU, please replace `"cuda"` with `"npu"` in your Python code. For details, see [NPU Support](/docs/en/Pipeline_Usage/GPU_support.md#ascend-npu).

--- a/docs/zh/Model_Details/FLUX2.md
+++ b/docs/zh/Model_Details/FLUX2.md
@@ -2,15 +2,6 @@

 FLUX.2 是由 Black Forest Labs 训练并开源的图像生成模型。

-## 模型血缘
-
-```mermaid
-graph LR;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-dev;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-klein-4B;
-    FLUX.2-Series-->black-forest-labs/FLUX.2-klein-9B;
-```
-
 ## 安装

 在使用本项目进行模型推理和训练前，请先安装 DiffSynth-Studio。
@@ -59,20 +50,16 @@ image.save("image.jpg")

 ## 模型总览

-|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
-|-|-|-|-|-|-|-|
-|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|-|-|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|
-|[black-forest-labs/FLUX.2-klein-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py)|
-|[black-forest-labs/FLUX.2-klein-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py)|
-|[black-forest-labs/FLUX.2-klein-base-4B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-4B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py)|
-|[black-forest-labs/FLUX.2-klein-base-9B](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-klein-base-9B)|[code](/examples/flux2/model_inference/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py)|[code](/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py)|
+|模型 ID|推理|低显存推理|LoRA 训练|LoRA 训练后验证|
+|-|-|-|-|-|
+|[black-forest-labs/FLUX.2-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.2-dev)|[code](/examples/flux2/model_inference/FLUX.2-dev.py)|[code](/examples/flux2/model_inference_low_vram/FLUX.2-dev.py)|[code](/examples/flux2/model_training/lora/FLUX.2-dev.sh)|[code](/examples/flux2/model_training/validate_lora/FLUX.2-dev.py)|

 特殊训练脚本：

-* 差分 LoRA 训练：[doc](/docs/zh/Training/Differential_LoRA.md)
-* FP8 精度训练：[doc](/docs/zh/Training/FP8_Precision.md)
-* 两阶段拆分训练：[doc](/docs/zh/Training/Split_Training.md)
-* 端到端直接蒸馏：[doc](/docs/zh/Training/Direct_Distill.md)
+* 差分 LoRA 训练：[doc](/docs/zh/Training/Differential_LoRA.md)、[code](/examples/flux/model_training/special/differential_training/)
+* FP8 精度训练：[doc](/docs/zh/Training/FP8_Precision.md)、[code](/examples/flux/model_training/special/fp8_training/)
+* 两阶段拆分训练：[doc](/docs/zh/Training/Split_Training.md)、[code](/examples/flux/model_training/special/split_training/)
+* 端到端直接蒸馏：[doc](/docs/zh/Training/Direct_Distill.md)、[code](/examples/flux/model_training/lora/FLUX.1-dev-Distill-LoRA.sh)

 ## 模型推理

@@ -148,4 +135,4 @@ FLUX.2 系列模型统一通过 [`examples/flux2/model_training/train.py`](/exam
 modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
 ```

-我们为每个模型编写了推荐的训练脚本，请参考前文"模型总览"中的表格。关于如何编写模型训练脚本，请参考[模型训练](/docs/zh/Pipeline_Usage/Model_Training.md)；更多高阶训练算法，请参考[训练框架详解](/docs/Training/)。
+我们为每个模型编写了推荐的训练脚本，请参考前文"模型总览"中的表格。关于如何编写模型训练脚本，请参考[模型训练](/docs/zh/Pipeline_Usage/Model_Training.md)；更多高阶训练算法，请参考[训练框架详解](/docs/Training/)。
--- a/docs/zh/Model_Details/Qwen-Image.md
+++ b/docs/zh/Model_Details/Qwen-Image.md
@@ -86,7 +86,6 @@ graph LR;
 |[Qwen/Qwen-Image-Edit-2509](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2509)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2509.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2509.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2509.py)|
 |[Qwen/Qwen-Image-Edit-2511](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2511)|[code](/examples/qwen_image/model_inference/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Edit-2511.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit-2511.py)|
 |[Qwen/Qwen-Image-Layered](https://www.modelscope.cn/models/Qwen/Qwen-Image-Layered)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered.py)|
-|[DiffSynth-Studio/Qwen-Image-Layered-Control](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Layered-Control)|[code](/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py)|[code](/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-V2.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
 |[DiffSynth-Studio/Qwen-Image-EliGen-Poster](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-Poster)|[code](/examples/qwen_image/model_inference/Qwen-Image-EliGen-Poster.py)|[code](/examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-Poster.py)|-|-|[code](/examples/qwen_image/model_training/lora/Qwen-Image-EliGen-Poster.sh)|[code](/examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen-Poster.py)|
--- a/docs/zh/Pipeline_Usage/GPU_support.md
+++ b/docs/zh/Pipeline_Usage/GPU_support.md
@@ -13,7 +13,7 @@
 AMD 提供了基于 ROCm 的 torch 包，所以大多数模型无需修改代码即可运行，少数模型由于依赖特定的 cuda 指令无法运行。

 ## Ascend NPU
-### 推理
+
 使用 Ascend NPU 时，需把代码中的 `"cuda"` 改为 `"npu"`。

 例如，Wan2.1-T2V-1.3B 的推理代码：
@@ -22,7 +22,6 @@ AMD 提供了基于 ROCm 的 torch 包，所以大多数模型无需修改代码
 import torch
 from diffsynth.utils.data import save_video, VideoData
 from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
-from diffsynth.core.device.npu_compatible_device import get_device_name

 vram_config = {
    "offload_dtype": "disk",
@@ -34,7 +33,7 @@ vram_config = {
 +   "preparing_device": "npu",
    "computation_dtype": torch.bfloat16,
 -   "computation_device": "cuda",
-+   "computation_device": "npu",
+   "preparing_device": "npu",
 }
 pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
@@ -47,7 +46,7 @@ pipe = WanVideoPipeline.from_pretrained(
    ],
    tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
 -   vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2,
-+   vram_limit=torch.npu.mem_get_info(get_device_name())[1] / (1024 ** 3) - 2,
+   vram_limit=torch.npu.mem_get_info("npu:0")[1] / (1024 ** 3) - 2,
 )

 video = pipe(
@@ -57,28 +56,3 @@ video = pipe(
 )
 save_video(video, "video.mp4", fps=15, quality=5)
 ```
-
-### 训练
-当前已为每类模型添加NPU的启动脚本样例，脚本存放在`examples/xxx/special/npu_training`目录下，例如 `examples/wanvideo/model_training/special/npu_training/Wan2.2-T2V-A14B-NPU.sh`。
-
-在NPU训练脚本中，添加了可以优化性能的NPU特有环境变量，并针对特定模型开启了相关参数。
-
-#### 环境变量
-```shell
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-```
-`expandable_segments:<value>`: 使能内存池扩展段功能，即虚拟内存特征。
-
-```shell
-export CPU_AFFINITY_CONF=1
-```
-设置0或未设置: 表示不启用绑核功能
-
-1: 表示开启粗粒度绑核
-
-2: 表示开启细粒度绑核
-
-#### 特定模型需要开启的参数
-| 模型        | 参数 | 备注                |
-|-----------|------|-------------------|
-| Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 |
--- a/docs/zh/Pipeline_Usage/Setup.md
+++ b/docs/zh/Pipeline_Usage/Setup.md
@@ -30,16 +30,11 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6

 * Ascend NPU

-1. 通过官方文档安装[CANN](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/softwareinst/instg/instg_quick.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit)
+Ascend NPU 通过 `torch-npu` 包提供支持，以 `2.1.0.post17` 版本（本文更新于 2025 年 12 月 15 日）为例，请运行以下命令

-2. 从源码安装
-   ```shell
-   git clone https://github.com/modelscope/DiffSynth-Studio.git
-   cd DiffSynth-Studio
-   # aarch64/ARM
-   pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu"
-   # x86
-   pip install -e .[npu]
+```shell
+pip install torch-npu==2.1.0.post17
+```

 使用 Ascend NPU 时，请将 Python 代码中的 `"cuda"` 改为 `"npu"`，详见[NPU 支持](/docs/zh/Pipeline_Usage/GPU_support.md#ascend-npu)。

--- a/examples/dev_tools/unit_test.py
+++ b/examples/dev_tools/unit_test.py
@@ -108,14 +108,7 @@ def test_flux():
    run_inference("examples/flux/model_training/validate_lora")


-def test_z_image():
-    run_inference("examples/z_image/model_inference")
-    run_inference("examples/z_image/model_inference_low_vram")
-    run_train_multi_GPU("examples/z_image/model_training/full")
-    run_inference("examples/z_image/model_training/validate_full")
-    run_train_single_GPU("examples/z_image/model_training/lora")
-    run_inference("examples/z_image/model_training/validate_lora")
-
-
 if __name__ == "__main__":
-    test_z_image()
+    test_qwen_image()
+    test_flux()
+    test_wan()
--- a/examples/flux/model_training/full/accelerate_config_zero3.yaml
+++ b/examples/flux/model_training/full/accelerate_config_zero3.yaml
@@ -1,23 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: true
-  zero3_save_16bit_model: true
-  zero_stage: 3
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/flux/model_training/special/npu_training/FLUX.1-Kontext-dev-NPU.sh
+++ b/examples/flux/model_training/special/npu_training/FLUX.1-Kontext-dev-NPU.sh
@@ -1,17 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/flux/model_training/full/accelerate_config_zero2offload.yaml examples/flux/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata_kontext.csv \
-  --data_file_keys "image,kontext_images" \
-  --max_pixels 1048576 \
-  --dataset_repeat 400 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.1-Kontext-dev:flux1-kontext-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
-  --learning_rate 1e-5 \
-  --num_epochs 1 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.1-Kontext-dev_full" \
-  --trainable_models "dit" \
-  --extra_inputs "kontext_images" \
-  --use_gradient_checkpointing
--- a/examples/flux/model_training/special/npu_training/FLUX.1-dev-NPU.sh
+++ b/examples/flux/model_training/special/npu_training/FLUX.1-dev-NPU.sh
@@ -1,15 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/flux/model_training/full/accelerate_config_zero2offload.yaml examples/flux/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 400 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
-  --learning_rate 1e-5 \
-  --num_epochs 1 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.1-dev_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
--- a/examples/flux2/model_inference/FLUX.2-klein-4B.py
+++ b/examples/flux2/model_inference/FLUX.2-klein-4B.py
@@ -1,21 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=4)
-image.save("image_FLUX.2-klein-4B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=4)
-image.save("image_edit_FLUX.2-klein-4B.jpg")
--- a/examples/flux2/model_inference/FLUX.2-klein-9B.py
+++ b/examples/flux2/model_inference/FLUX.2-klein-9B.py
@@ -1,21 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=4)
-image.save("image_FLUX.2-klein-9B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=4)
-image.save("image_edit_FLUX.2-klein-9B.jpg")
--- a/examples/flux2/model_inference/FLUX.2-klein-base-4B.py
+++ b/examples/flux2/model_inference/FLUX.2-klein-base-4B.py
@@ -1,21 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_FLUX.2-klein-base-4B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_edit_FLUX.2-klein-base-4B.jpg")
--- a/examples/flux2/model_inference/FLUX.2-klein-base-9B.py
+++ b/examples/flux2/model_inference/FLUX.2-klein-base-9B.py
@@ -1,21 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_FLUX.2-klein-base-9B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_edit_FLUX.2-klein-base-9B.jpg")
--- a/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py
+++ b/examples/flux2/model_inference_low_vram/FLUX.2-klein-4B.py
@@ -1,31 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-vram_config = {
-    "offload_dtype": "disk",
-    "offload_device": "disk",
-    "onload_dtype": torch.float8_e4m3fn,
-    "onload_device": "cpu",
-    "preparing_dtype": torch.float8_e4m3fn,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=4)
-image.save("image_FLUX.2-klein-4B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=4)
-image.save("image_edit_FLUX.2-klein-4B.jpg")
--- a/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py
+++ b/examples/flux2/model_inference_low_vram/FLUX.2-klein-9B.py
@@ -1,31 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-vram_config = {
-    "offload_dtype": "disk",
-    "offload_device": "disk",
-    "onload_dtype": torch.float8_e4m3fn,
-    "onload_device": "cpu",
-    "preparing_dtype": torch.float8_e4m3fn,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="transformer/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=4)
-image.save("image_FLUX.2-klein-9B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=4)
-image.save("image_edit_FLUX.2-klein-9B.jpg")
--- a/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py
+++ b/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-4B.py
@@ -1,31 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-vram_config = {
-    "offload_dtype": "disk",
-    "offload_device": "disk",
-    "onload_dtype": torch.float8_e4m3fn,
-    "onload_device": "cpu",
-    "preparing_dtype": torch.float8_e4m3fn,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-4B", origin_file_pattern="transformer/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_FLUX.2-klein-base-4B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_edit_FLUX.2-klein-base-4B.jpg")
--- a/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py
+++ b/examples/flux2/model_inference_low_vram/FLUX.2-klein-base-9B.py
@@ -1,31 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-vram_config = {
-    "offload_dtype": "disk",
-    "offload_device": "disk",
-    "onload_dtype": torch.float8_e4m3fn,
-    "onload_device": "cpu",
-    "preparing_dtype": torch.float8_e4m3fn,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-9B", origin_file_pattern="transformer/*.safetensors", **vram_config),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-prompt = "Masterpiece, best quality. Anime-style portrait of a woman in a blue dress, underwater, surrounded by colorful bubbles."
-image = pipe(prompt, seed=0, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_FLUX.2-klein-base-9B.jpg")
-
-prompt = "change the color of the clothes to red"
-image = pipe(prompt, edit_image=[image], seed=1, rand_device="cuda", num_inference_steps=50, cfg_scale=4)
-image.save("image_edit_FLUX.2-klein-base-9B.jpg")
--- a/examples/flux2/model_training/full/FLUX.2-klein-4B.sh
+++ b/examples/flux2/model_training/full/FLUX.2-klein-4B.sh
@@ -1,30 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-4B_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-#   --learning_rate 1e-5 \
-#   --num_epochs 2 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-4B_full" \
-#   --trainable_models "dit" \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/full/FLUX.2-klein-9B.sh
+++ b/examples/flux2/model_training/full/FLUX.2-klein-9B.sh
@@ -1,31 +0,0 @@
-# This script is tested on 8*A100
-accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-9B_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-#   --learning_rate 1e-5 \
-#   --num_epochs 2 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-9B_full" \
-#   --trainable_models "dit" \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh
+++ b/examples/flux2/model_training/full/FLUX.2-klein-base-4B.sh
@@ -1,30 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-base-4B_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-#   --learning_rate 1e-5 \
-#   --num_epochs 2 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-base-4B_full" \
-#   --trainable_models "dit" \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh
+++ b/examples/flux2/model_training/full/FLUX.2-klein-base-9B.sh
@@ -1,31 +0,0 @@
-# This script is tested on 8*A100
-accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-base-9B_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-#   --learning_rate 1e-5 \
-#   --num_epochs 2 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-base-9B_full" \
-#   --trainable_models "dit" \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/full/accelerate_config.yaml
+++ b/examples/flux2/model_training/full/accelerate_config.yaml
@@ -1,22 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: false
-  zero_stage: 2
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/flux2/model_training/full/accelerate_config_zero3.yaml
+++ b/examples/flux2/model_training/full/accelerate_config_zero3.yaml
@@ -1,23 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: true
-  zero3_save_16bit_model: true
-  zero_stage: 3
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh
+++ b/examples/flux2/model_training/lora/FLUX.2-klein-4B.sh
@@ -1,34 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-4B_lora" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-#   --learning_rate 1e-4 \
-#   --num_epochs 5 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-4B_lora" \
-#   --lora_base_model "dit" \
-#   --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out" \
-#   --lora_rank 32 \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh
+++ b/examples/flux2/model_training/lora/FLUX.2-klein-9B.sh
@@ -1,34 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-9B_lora" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-#   --learning_rate 1e-4 \
-#   --num_epochs 5 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-9B_lora" \
-#   --lora_base_model "dit" \
-#   --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out" \
-#   --lora_rank 32 \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh
+++ b/examples/flux2/model_training/lora/FLUX.2-klein-base-4B.sh
@@ -1,34 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-base-4B_lora" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-4B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-4B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-4B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-4B:tokenizer/" \
-#   --learning_rate 1e-4 \
-#   --num_epochs 5 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-base-4B_lora" \
-#   --lora_base_model "dit" \
-#   --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out" \
-#   --lora_rank 32 \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh
+++ b/examples/flux2/model_training/lora/FLUX.2-klein-base-9B.sh
@@ -1,34 +0,0 @@
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-base-9B_lora" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-base-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-#   --learning_rate 1e-4 \
-#   --num_epochs 5 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-base-9B_lora" \
-#   --lora_base_model "dit" \
-#   --lora_target_modules "to_q,to_k,to_v,to_out.0,add_q_proj,add_k_proj,add_v_proj,to_add_out,linear_in,linear_out,to_qkv_mlp_proj,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out" \
-#   --lora_rank 32 \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/special/npu_training/FLUX.2-dev-Lora-NPU.sh
+++ b/examples/flux2/model_training/special/npu_training/FLUX.2-dev-Lora-NPU.sh
@@ -1,36 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 1 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-dev:text_encoder/*.safetensors,black-forest-labs/FLUX.2-dev:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-dev-LoRA-splited-cache" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_qkv_mlp_proj,to_out.0,to_add_out,linear_in,linear_out,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out,single_transformer_blocks.24.attn.to_out,single_transformer_blocks.25.attn.to_out,single_transformer_blocks.26.attn.to_out,single_transformer_blocks.27.attn.to_out,single_transformer_blocks.28.attn.to_out,single_transformer_blocks.29.attn.to_out,single_transformer_blocks.30.attn.to_out,single_transformer_blocks.31.attn.to_out,single_transformer_blocks.32.attn.to_out,single_transformer_blocks.33.attn.to_out,single_transformer_blocks.34.attn.to_out,single_transformer_blocks.35.attn.to_out,single_transformer_blocks.36.attn.to_out,single_transformer_blocks.37.attn.to_out,single_transformer_blocks.38.attn.to_out,single_transformer_blocks.39.attn.to_out,single_transformer_blocks.40.attn.to_out,single_transformer_blocks.41.attn.to_out,single_transformer_blocks.42.attn.to_out,single_transformer_blocks.43.attn.to_out,single_transformer_blocks.44.attn.to_out,single_transformer_blocks.45.attn.to_out,single_transformer_blocks.46.attn.to_out,single_transformer_blocks.47.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --task "sft:data_process"
-
-accelerate launch --config_file examples/flux2/model_training/full/accelerate_config_zero3.yaml examples/flux2/model_training/train.py \
-  --dataset_base_path "./models/train/FLUX.2-dev-LoRA-splited-cache" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-dev:transformer/*.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-dev-LoRA-splited" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_qkv_mlp_proj,to_out.0,to_add_out,linear_in,linear_out,single_transformer_blocks.0.attn.to_out,single_transformer_blocks.1.attn.to_out,single_transformer_blocks.2.attn.to_out,single_transformer_blocks.3.attn.to_out,single_transformer_blocks.4.attn.to_out,single_transformer_blocks.5.attn.to_out,single_transformer_blocks.6.attn.to_out,single_transformer_blocks.7.attn.to_out,single_transformer_blocks.8.attn.to_out,single_transformer_blocks.9.attn.to_out,single_transformer_blocks.10.attn.to_out,single_transformer_blocks.11.attn.to_out,single_transformer_blocks.12.attn.to_out,single_transformer_blocks.13.attn.to_out,single_transformer_blocks.14.attn.to_out,single_transformer_blocks.15.attn.to_out,single_transformer_blocks.16.attn.to_out,single_transformer_blocks.17.attn.to_out,single_transformer_blocks.18.attn.to_out,single_transformer_blocks.19.attn.to_out,single_transformer_blocks.20.attn.to_out,single_transformer_blocks.21.attn.to_out,single_transformer_blocks.22.attn.to_out,single_transformer_blocks.23.attn.to_out,single_transformer_blocks.24.attn.to_out,single_transformer_blocks.25.attn.to_out,single_transformer_blocks.26.attn.to_out,single_transformer_blocks.27.attn.to_out,single_transformer_blocks.28.attn.to_out,single_transformer_blocks.29.attn.to_out,single_transformer_blocks.30.attn.to_out,single_transformer_blocks.31.attn.to_out,single_transformer_blocks.32.attn.to_out,single_transformer_blocks.33.attn.to_out,single_transformer_blocks.34.attn.to_out,single_transformer_blocks.35.attn.to_out,single_transformer_blocks.36.attn.to_out,single_transformer_blocks.37.attn.to_out,single_transformer_blocks.38.attn.to_out,single_transformer_blocks.39.attn.to_out,single_transformer_blocks.40.attn.to_out,single_transformer_blocks.41.attn.to_out,single_transformer_blocks.42.attn.to_out,single_transformer_blocks.43.attn.to_out,single_transformer_blocks.44.attn.to_out,single_transformer_blocks.45.attn.to_out,single_transformer_blocks.46.attn.to_out,single_transformer_blocks.47.attn.to_out" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --initialize_model_on_cpu \
-  --task "sft:train"
--- a/examples/flux2/model_training/special/npu_training/FLUX.2-klein-9B-NPU.sh
+++ b/examples/flux2/model_training/special/npu_training/FLUX.2-klein-9B-NPU.sh
@@ -1,34 +0,0 @@
-# This script is tested on 8*910B(NPU)
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-  --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/FLUX.2-klein-9B_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing
-
-# Edit
-# accelerate launch --config_file examples/flux2/model_training/full/accelerate_config.yaml examples/flux2/model_training/train.py \
-#   --dataset_base_path data/example_image_dataset \
-#   --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-#   --data_file_keys "image,edit_image" \
-#   --extra_inputs "edit_image" \
-#   --max_pixels 1048576 \
-#   --dataset_repeat 50 \
-#   --model_id_with_origin_paths "black-forest-labs/FLUX.2-klein-9B:text_encoder/*.safetensors,black-forest-labs/FLUX.2-klein-9B:transformer/*.safetensors,black-forest-labs/FLUX.2-klein-9B:vae/diffusion_pytorch_model.safetensors" \
-#   --tokenizer_path "black-forest-labs/FLUX.2-klein-9B:tokenizer/" \
-#   --learning_rate 1e-5 \
-#   --num_epochs 2 \
-#   --remove_prefix_in_ckpt "pipe.dit." \
-#   --output_path "./models/train/FLUX.2-klein-9B_full" \
-#   --trainable_models "dit" \
-#   --use_gradient_checkpointing
--- a/examples/flux2/model_training/train.py
+++ b/examples/flux2/model_training/train.py
@@ -24,7 +24,7 @@ class Flux2ImageTrainingModule(DiffusionTrainingModule):
        super().__init__()
        # Load models
        model_configs = self.parse_model_configs(model_paths, model_id_with_origin_paths, fp8_models=fp8_models, offload_models=offload_models, device=device)
-        tokenizer_config = self.parse_path_or_model_id(tokenizer_path, default_value=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"))
+        tokenizer_config = ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/") if tokenizer_path is None else ModelConfig(tokenizer_path)
        self.pipe = Flux2ImagePipeline.from_pretrained(torch_dtype=torch.bfloat16, device=device, model_configs=model_configs, tokenizer_config=tokenizer_config)
        self.pipe = self.split_pipeline_units(task, self.pipe, trainable_models, lora_base_model)

@@ -85,7 +85,6 @@ def flux2_parser():
    parser = add_general_config(parser)
    parser = add_image_size_config(parser)
    parser.add_argument("--tokenizer_path", type=str, default=None, help="Path to tokenizer.")
-    parser.add_argument("--initialize_model_on_cpu", default=False, action="store_true", help="Whether to initialize models on CPU.")
    return parser


@@ -127,7 +126,7 @@ if __name__ == "__main__":
        fp8_models=args.fp8_models,
        offload_models=args.offload_models,
        task=args.task,
-        device="cpu" if args.initialize_model_on_cpu else accelerator.device,
+        device=accelerator.device,
    )
    model_logger = ModelLogger(
        args.output_path,
--- a/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py
+++ b/examples/flux2/model_training/validate_full/FLUX.2-klein-4B.py
@@ -1,20 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-from diffsynth.core import load_state_dict
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-state_dict = load_state_dict("./models/train/FLUX.2-klein-4B_full/epoch-1.safetensors", torch_dtype=torch.bfloat16)
-pipe.dit.load_state_dict(state_dict)
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py
+++ b/examples/flux2/model_training/validate_full/FLUX.2-klein-9B.py
@@ -1,20 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-from diffsynth.core import load_state_dict
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-state_dict = load_state_dict("./models/train/FLUX.2-klein-9B_full/epoch-1.safetensors", torch_dtype=torch.bfloat16)
-pipe.dit.load_state_dict(state_dict)
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py
+++ b/examples/flux2/model_training/validate_full/FLUX.2-klein-base-4B.py
@@ -1,20 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-from diffsynth.core import load_state_dict
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-state_dict = load_state_dict("./models/train/FLUX.2-klein-base-4B_full/epoch-1.safetensors", torch_dtype=torch.bfloat16)
-pipe.dit.load_state_dict(state_dict)
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py
+++ b/examples/flux2/model_training/validate_full/FLUX.2-klein-base-9B.py
@@ -1,20 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-from diffsynth.core import load_state_dict
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-state_dict = load_state_dict("./models/train/FLUX.2-klein-base-9B_full/epoch-1.safetensors", torch_dtype=torch.bfloat16)
-pipe.dit.load_state_dict(state_dict)
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py
+++ b/examples/flux2/model_training/validate_lora/FLUX.2-klein-4B.py
@@ -1,18 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="tokenizer/"),
-)
-pipe.load_lora(pipe.dit, "./models/train/FLUX.2-klein-4B_lora/epoch-4.safetensors")
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py
+++ b/examples/flux2/model_training/validate_lora/FLUX.2-klein-9B.py
@@ -1,18 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-pipe.load_lora(pipe.dit, "./models/train/FLUX.2-klein-9B_lora/epoch-4.safetensors")
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py
+++ b/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-4B.py
@@ -1,18 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-4B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-4B", origin_file_pattern="tokenizer/"),
-)
-pipe.load_lora(pipe.dit, "./models/train/FLUX.2-klein-base-4B_lora/epoch-4.safetensors")
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py
+++ b/examples/flux2/model_training/validate_lora/FLUX.2-klein-base-9B.py
@@ -1,18 +0,0 @@
-from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
-import torch
-
-
-pipe = Flux2ImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-base-9B", origin_file_pattern="transformer/*.safetensors"),
-        ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-klein-9B", origin_file_pattern="tokenizer/"),
-)
-pipe.load_lora(pipe.dit, "./models/train/FLUX.2-klein-base-9B_lora/epoch-4.safetensors")
-prompt = "a dog"
-image = pipe(prompt=prompt, seed=0, num_inference_steps=40, cfg_scale=4, height=768, width=768)
-image.save("image.jpg")
--- a/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py
+++ b/examples/qwen_image/model_inference/Qwen-Image-Layered-Control.py
@@ -1,34 +0,0 @@
-from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
-from modelscope import snapshot_download
-from PIL import Image
-import torch
-
-
-pipe = QwenImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
-)
-
-snapshot_download(
-    model_id="DiffSynth-Studio/Qwen-Image-Layered-Control",
-    allow_file_pattern="assets/image_1_input.png",
-    local_dir="data/layered_input"
-)
-
-prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
-input_image = Image.open("data/layered_input/assets/image_1_input.png").convert("RGBA").resize((1024, 1024))
-images = pipe(
-    prompt,
-    seed=0,
-    num_inference_steps=30, cfg_scale=4,
-    height=1024, width=1024,
-    layer_input_image=input_image,
-    layer_num=0,
-)
-images[0].save("image.png")
--- a/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py
+++ b/examples/qwen_image/model_inference_low_vram/Qwen-Image-Layered-Control.py
@@ -1,44 +0,0 @@
-from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
-from modelscope import snapshot_download
-from PIL import Image
-import torch
-
-
-vram_config = {
-    "offload_dtype": "disk",
-    "offload_device": "disk",
-    "onload_dtype": torch.float8_e4m3fn,
-    "onload_device": "cpu",
-    "preparing_dtype": torch.float8_e4m3fn,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-pipe = QwenImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
-        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
-        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
-    ],
-    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
-)
-
-snapshot_download(
-    model_id="DiffSynth-Studio/Qwen-Image-Layered-Control",
-    allow_file_pattern="assets/image_1_input.png",
-    local_dir="data/layered_input"
-)
-
-prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
-input_image = Image.open("data/layered_input/assets/image_1_input.png").convert("RGBA").resize((1024, 1024))
-images = pipe(
-    prompt,
-    seed=0,
-    num_inference_steps=30, cfg_scale=4,
-    height=1024, width=1024,
-    layer_input_image=input_image,
-    layer_num=0,
-)
-images[0].save("image.png")
--- a/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh
+++ b/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh
@@ -1,18 +0,0 @@
-# Example Dataset: https://modelscope.cn/datasets/DiffSynth-Studio/example_image_dataset/tree/master/layer
-
-accelerate launch --config_file examples/qwen_image/model_training/full/accelerate_config_zero2offload.yaml examples/qwen_image/model_training/train.py \
-  --dataset_base_path data/example_image_dataset/layer \
-  --dataset_metadata_path data/example_image_dataset/layer/metadata_layered_control.json \
-  --data_file_keys "image,layer_input_image" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "DiffSynth-Studio/Qwen-Image-Layered-Control:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image-Layered:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-Layered-Control_full" \
-  --trainable_models "dit" \
-  --extra_inputs "layer_num,layer_input_image" \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters
--- a/examples/qwen_image/model_training/full/accelerate_config_zero3.yaml
+++ b/examples/qwen_image/model_training/full/accelerate_config_zero3.yaml
@@ -1,23 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: true
-  zero3_save_16bit_model: true
-  zero_stage: 3
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh
+++ b/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh
@@ -1,20 +0,0 @@
-# Example Dataset: https://modelscope.cn/datasets/DiffSynth-Studio/example_image_dataset/tree/master/layer
-
-accelerate launch examples/qwen_image/model_training/train.py \
-  --dataset_base_path data/example_image_dataset/layer \
-  --dataset_metadata_path data/example_image_dataset/layer/metadata_layered_control.json \
-  --data_file_keys "image,layer_input_image" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "DiffSynth-Studio/Qwen-Image-Layered-Control:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image-Layered:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-Layered-Control_lora" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
-  --lora_rank 32 \
-  --extra_inputs "layer_num,layer_input_image" \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters
--- a/examples/qwen_image/model_training/special/npu_training/Qwen-Image-Edit-2509-LoRA-NPU.sh
+++ b/examples/qwen_image/model_training/special/npu_training/Qwen-Image-Edit-2509-LoRA-NPU.sh
@@ -1,38 +0,0 @@
-# Due to memory limitations, split training is required to train the model on NPU
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch examples/qwen_image/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 1 \
-  --model_id_with_origin_paths "Qwen/Qwen-Image-Edit-2509:text_encoder/model*.safetensors,Qwen/Qwen-Image-Edit-2509:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-Edit-2509-LoRA-splited-cache" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters \
-  --task "sft:data_process"
-
-accelerate launch examples/qwen_image/model_training/train.py \
-  --dataset_base_path "./models/train/Qwen-Image-Edit-2509-LoRA-splited-cache" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "Qwen/Qwen-Image-Edit-2509:transformer/diffusion_pytorch_model*.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-Edit-2509-LoRA-splited" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters \
-  --task "sft:train"
--- a/examples/qwen_image/model_training/special/npu_training/Qwen-Image-Edit-2509-NPU.sh
+++ b/examples/qwen_image/model_training/special/npu_training/Qwen-Image-Edit-2509-NPU.sh
@@ -1,19 +0,0 @@
-# This script was tested using zero3 and on 8*910B(NPU)
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/qwen_image/model_training/full/accelerate_config_zero3.yaml examples/qwen_image/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata_qwen_imgae_edit_multi.json \
-  --data_file_keys "image,edit_image" \
-  --extra_inputs "edit_image" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "Qwen/Qwen-Image-Edit-2509:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-Edit-2509_full" \
-  --trainable_models "dit" \
-  --use_gradient_checkpointing \
-  --find_unused_parameters
--- a/examples/qwen_image/model_training/special/npu_training/Qwen-Image-LoRA-NPU.sh
+++ b/examples/qwen_image/model_training/special/npu_training/Qwen-Image-LoRA-NPU.sh
@@ -1,38 +0,0 @@
-# Due to memory limitations, split training is required to train the model on NPU
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch examples/qwen_image/model_training/train.py \
-  --dataset_base_path data/example_image_dataset \
-  --dataset_metadata_path data/example_image_dataset/metadata.csv \
-  --max_pixels 1048576 \
-  --dataset_repeat 1 \
-  --model_id_with_origin_paths "Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-LoRA-splited-cache" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters \
-  --task "sft:data_process"
-
-accelerate launch examples/qwen_image/model_training/train.py \
-  --dataset_base_path "./models/train/Qwen-Image-LoRA-splited-cache" \
-  --max_pixels 1048576 \
-  --dataset_repeat 50 \
-  --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors" \
-  --learning_rate 1e-4 \
-  --num_epochs 5 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Qwen-Image-LoRA-splited" \
-  --lora_base_model "dit" \
-  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
-  --lora_rank 32 \
-  --use_gradient_checkpointing \
-  --dataset_num_workers 8 \
-  --find_unused_parameters \
-  --task "sft:train"
--- a/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py
+++ b/examples/qwen_image/model_training/validate_full/Qwen-Image-Layered-Control.py
@@ -1,26 +0,0 @@
-from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
-from diffsynth import load_state_dict
-from PIL import Image
-import torch
-
-
-pipe = QwenImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
-)
-state_dict = load_state_dict("models/train/Qwen-Image-Layered-Control_full/epoch-1.safetensors")
-pipe.dit.load_state_dict(state_dict)
-prompt = "Text 'HELLO' and 'Have a great day'"
-input_image = Image.open("data/example_image_dataset/layer/image.png").convert("RGBA").resize((864, 480))
-images = pipe(
-    prompt, seed=0,
-    height=480, width=864,
-    layer_input_image=input_image, layer_num=0,
-)
-images[0].save("image.png")
--- a/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py
+++ b/examples/qwen_image/model_training/validate_lora/Qwen-Image-Layered-Control.py
@@ -1,25 +0,0 @@
-from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
-from diffsynth import load_state_dict
-from PIL import Image
-import torch
-
-
-pipe = QwenImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
-        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
-)
-pipe.load_lora(pipe.dit, "models/train/Qwen-Image-Layered-Control_lora/epoch-4.safetensors")
-prompt = "Text 'HELLO' and 'Have a great day'"
-input_image = Image.open("data/example_image_dataset/layer/image.png").convert("RGBA").resize((864, 480))
-images = pipe(
-    prompt, seed=0,
-    height=480, width=864,
-    layer_input_image=input_image, layer_num=0,
-)
-images[0].save("image.png")
--- a/examples/wanvideo/model_training/full/Wan2.1-VACE-1.3B-Preview.sh
+++ b/examples/wanvideo/model_training/full/Wan2.1-VACE-1.3B-Preview.sh
@@ -7,11 +7,10 @@ accelerate launch examples/wanvideo/model_training/train.py \
  --num_frames 49 \
  --dataset_repeat 100 \
  --model_id_with_origin_paths "iic/VACE-Wan2.1-1.3B-Preview:diffusion_pytorch_model*.safetensors,iic/VACE-Wan2.1-1.3B-Preview:models_t5_umt5-xxl-enc-bf16.pth,iic/VACE-Wan2.1-1.3B-Preview:Wan2.1_VAE.pth" \
-  --learning_rate 5e-5 \
+  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.vace." \
  --output_path "./models/train/Wan2.1-VACE-1.3B-Preview_full" \
  --trainable_models "vace" \
  --extra_inputs "vace_video,vace_reference_image" \
-  --use_gradient_checkpointing_offload
-# The learning rate is kept consistent with the settings in the original paper
+  --use_gradient_checkpointing_offload
--- a/examples/wanvideo/model_training/full/Wan2.1-VACE-1.3B.sh
+++ b/examples/wanvideo/model_training/full/Wan2.1-VACE-1.3B.sh
@@ -7,11 +7,10 @@ accelerate launch examples/wanvideo/model_training/train.py \
  --num_frames 49 \
  --dataset_repeat 100 \
  --model_id_with_origin_paths "Wan-AI/Wan2.1-VACE-1.3B:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-VACE-1.3B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-VACE-1.3B:Wan2.1_VAE.pth" \
-  --learning_rate 5e-5 \
+  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.vace." \
  --output_path "./models/train/Wan2.1-VACE-1.3B_full" \
  --trainable_models "vace" \
  --extra_inputs "vace_video,vace_reference_image" \
-  --use_gradient_checkpointing_offload
-# The learning rate is kept consistent with the settings in the original paper
+  --use_gradient_checkpointing_offload
--- a/examples/wanvideo/model_training/full/Wan2.1-VACE-14B.sh
+++ b/examples/wanvideo/model_training/full/Wan2.1-VACE-14B.sh
@@ -7,11 +7,10 @@ accelerate launch --config_file examples/wanvideo/model_training/full/accelerate
  --num_frames 17 \
  --dataset_repeat 100 \
  --model_id_with_origin_paths "Wan-AI/Wan2.1-VACE-14B:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-VACE-14B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-VACE-14B:Wan2.1_VAE.pth" \
-  --learning_rate 5e-5 \
+  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.vace." \
  --output_path "./models/train/Wan2.1-VACE-14B_full" \
  --trainable_models "vace" \
  --extra_inputs "vace_video,vace_reference_image" \
-  --use_gradient_checkpointing_offload
-# The learning rate is kept consistent with the settings in the original paper
+  --use_gradient_checkpointing_offload
--- a/examples/wanvideo/model_training/full/Wan2.2-VACE-Fun-A14B.sh
+++ b/examples/wanvideo/model_training/full/Wan2.2-VACE-Fun-A14B.sh
@@ -7,7 +7,7 @@ accelerate launch --config_file examples/wanvideo/model_training/full/accelerate
  --num_frames 17 \
  --dataset_repeat 100 \
  --model_id_with_origin_paths "PAI/Wan2.2-VACE-Fun-A14B:high_noise_model/diffusion_pytorch_model*.safetensors,PAI/Wan2.2-VACE-Fun-A14B:models_t5_umt5-xxl-enc-bf16.pth,PAI/Wan2.2-VACE-Fun-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 5e-5 \
+  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.vace." \
  --output_path "./models/train/Wan2.2-VACE-Fun-A14B_high_noise_full" \
@@ -18,7 +18,6 @@ accelerate launch --config_file examples/wanvideo/model_training/full/accelerate
  --min_timestep_boundary 0 \
  --initialize_model_on_cpu
 # boundary corresponds to timesteps [900, 1000]
-# The learning rate is kept consistent with the settings in the original paper


 accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
@@ -30,7 +29,7 @@ accelerate launch --config_file examples/wanvideo/model_training/full/accelerate
  --num_frames 17 \
  --dataset_repeat 100 \
  --model_id_with_origin_paths "PAI/Wan2.2-VACE-Fun-A14B:low_noise_model/diffusion_pytorch_model*.safetensors,PAI/Wan2.2-VACE-Fun-A14B:models_t5_umt5-xxl-enc-bf16.pth,PAI/Wan2.2-VACE-Fun-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 5e-5 \
+  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.vace." \
  --output_path "./models/train/Wan2.2-VACE-Fun-A14B_low_noise_full" \
@@ -40,5 +39,4 @@ accelerate launch --config_file examples/wanvideo/model_training/full/accelerate
  --max_timestep_boundary 1 \
  --min_timestep_boundary 0.358 \
  --initialize_model_on_cpu
-# boundary corresponds to timesteps [0, 900]
-# The learning rate is kept consistent with the settings in the original paper
+# boundary corresponds to timesteps [0, 900]
--- a/examples/wanvideo/model_training/full/accelerate_config_zero3.yaml
+++ b/examples/wanvideo/model_training/full/accelerate_config_zero3.yaml
@@ -1,23 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: true
-  zero3_save_16bit_model: true
-  zero_stage: 3
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/wanvideo/model_training/special/npu_training/Wan2.1-T2V-14B-NPU.sh
+++ b/examples/wanvideo/model_training/special/npu_training/Wan2.1-T2V-14B-NPU.sh
@@ -1,16 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
-  --dataset_base_path data/example_video_dataset \
-  --dataset_metadata_path data/example_video_dataset/metadata.csv \
-  --height 480 \
-  --width 832 \
-  --dataset_repeat 100 \
-  --model_id_with_origin_paths "Wan-AI/Wan2.1-T2V-14B:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-T2V-14B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-T2V-14B:Wan2.1_VAE.pth" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Wan2.1-T2V-14B_full" \
-  --trainable_models "dit" \
-  --initialize_model_on_cpu
--- a/examples/wanvideo/model_training/special/npu_training/Wan2.2-T2V-A14B-NPU.sh
+++ b/examples/wanvideo/model_training/special/npu_training/Wan2.2-T2V-A14B-NPU.sh
@@ -1,38 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
-  --dataset_base_path data/example_video_dataset \
-  --dataset_metadata_path data/example_video_dataset/metadata.csv \
-  --height 480 \
-  --width 832 \
-  --num_frames 49 \
-  --dataset_repeat 100 \
-  --model_id_with_origin_paths "Wan-AI/Wan2.2-T2V-A14B:high_noise_model/diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.2-T2V-A14B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.2-T2V-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Wan2.2-T2V-A14B_high_noise_full" \
-  --trainable_models "dit" \
-  --max_timestep_boundary 0.417 \
-  --min_timestep_boundary 0 \
-  --initialize_model_on_cpu
-# boundary corresponds to timesteps [875, 1000]
-
-accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
-  --dataset_base_path data/example_video_dataset \
-  --dataset_metadata_path data/example_video_dataset/metadata.csv \
-  --height 480 \
-  --width 832 \
-  --num_frames 49 \
-  --dataset_repeat 100 \
-  --model_id_with_origin_paths "Wan-AI/Wan2.2-T2V-A14B:low_noise_model/diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.2-T2V-A14B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.2-T2V-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 1e-5 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.dit." \
-  --output_path "./models/train/Wan2.2-T2V-A14B_low_noise_full" \
-  --trainable_models "dit" \
-  --max_timestep_boundary 1 \
-  --min_timestep_boundary 0.417 \
-  --initialize_model_on_cpu
-# boundary corresponds to timesteps [0, 875)
--- a/examples/wanvideo/model_training/special/npu_training/Wan2.2-VACE-Fun-A14B-NPU.sh
+++ b/examples/wanvideo/model_training/special/npu_training/Wan2.2-VACE-Fun-A14B-NPU.sh
@@ -1,45 +0,0 @@
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export CPU_AFFINITY_CONF=1
-
-accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
-  --dataset_base_path data/example_video_dataset \
-  --dataset_metadata_path data/example_video_dataset/metadata_vace.csv \
-  --data_file_keys "video,vace_video,vace_reference_image" \
-  --height 480 \
-  --width 832 \
-  --num_frames 17 \
-  --dataset_repeat 100 \
-  --model_id_with_origin_paths "PAI/Wan2.2-VACE-Fun-A14B:high_noise_model/diffusion_pytorch_model*.safetensors,PAI/Wan2.2-VACE-Fun-A14B:models_t5_umt5-xxl-enc-bf16.pth,PAI/Wan2.2-VACE-Fun-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 1e-4 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.vace." \
-  --output_path "./models/train/Wan2.2-VACE-Fun-A14B_high_noise_full" \
-  --trainable_models "vace" \
-  --extra_inputs "vace_video,vace_reference_image" \
-  --use_gradient_checkpointing_offload \
-  --max_timestep_boundary 0.358 \
-  --min_timestep_boundary 0 \
-  --initialize_model_on_cpu
-# boundary corresponds to timesteps [900, 1000]
-
-
-accelerate launch --config_file examples/wanvideo/model_training/full/accelerate_config_14B.yaml examples/wanvideo/model_training/train.py \
-  --dataset_base_path data/example_video_dataset \
-  --dataset_metadata_path data/example_video_dataset/metadata_vace.csv \
-  --data_file_keys "video,vace_video,vace_reference_image" \
-  --height 480 \
-  --width 832 \
-  --num_frames 17 \
-  --dataset_repeat 100 \
-  --model_id_with_origin_paths "PAI/Wan2.2-VACE-Fun-A14B:low_noise_model/diffusion_pytorch_model*.safetensors,PAI/Wan2.2-VACE-Fun-A14B:models_t5_umt5-xxl-enc-bf16.pth,PAI/Wan2.2-VACE-Fun-A14B:Wan2.1_VAE.pth" \
-  --learning_rate 1e-4 \
-  --num_epochs 2 \
-  --remove_prefix_in_ckpt "pipe.vace." \
-  --output_path "./models/train/Wan2.2-VACE-Fun-A14B_low_noise_full" \
-  --trainable_models "vace" \
-  --extra_inputs "vace_video,vace_reference_image" \
-  --use_gradient_checkpointing_offload \
-  --max_timestep_boundary 1 \
-  --min_timestep_boundary 0.358 \
-  --initialize_model_on_cpu
-# boundary corresponds to timesteps [0, 900]
--- a/examples/z_image/model_inference/Z-Image-Omni-Base-i2L.py
+++ b/examples/z_image/model_inference/Z-Image-Omni-Base-i2L.py
@@ -1,62 +0,0 @@
-from diffsynth.pipelines.z_image import (
-    ZImagePipeline, ModelConfig,
-    ZImageUnit_Image2LoRAEncode, ZImageUnit_Image2LoRADecode
-)
-from modelscope import snapshot_download
-from safetensors.torch import save_file
-import torch
-from PIL import Image
-
-# Use `vram_config` to enable LoRA hot-loading
-vram_config = {
-    "offload_dtype": torch.bfloat16,
-    "offload_device": "cuda",
-    "onload_dtype": torch.bfloat16,
-    "onload_device": "cuda",
-    "preparing_dtype": torch.bfloat16,
-    "preparing_device": "cuda",
-    "computation_dtype": torch.bfloat16,
-    "computation_device": "cuda",
-}
-
-# Load models
-pipe = ZImagePipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda",
-    model_configs=[
-        ModelConfig(model_id="Tongyi-MAI/Z-Image-Omni-Base", origin_file_pattern="transformer/*.safetensors", **vram_config),
-        ModelConfig(model_id="Tongyi-MAI/Z-Image-Omni-Base", origin_file_pattern="siglip/model.safetensors"),
-        ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="text_encoder/*.safetensors"),
-        ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
-        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors"),
-        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors"),
-        ModelConfig(model_id="DiffSynth-Studio/Z-Image-Omni-Base-i2L", origin_file_pattern="model.safetensors"),
-    ],
-    tokenizer_config=ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="tokenizer/"),
-)
-
-# Load images
-snapshot_download(
-    model_id="DiffSynth-Studio/Z-Image-Omni-Base-i2L",
-    allow_file_pattern="assets/style/*",
-    local_dir="data/style_input"
-)
-images = [Image.open(f"data/style_input/assets/style/1/{i}.jpg") for i in range(6)]
-
-# Image to LoRA
-with torch.no_grad():
-    embs = ZImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
-    lora = ZImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]
-save_file(lora, "lora.safetensors")
-
-# Generate images
-prompt = "a cat"
-negative_prompt = "泛黄，发绿，模糊，低分辨率，低质量图像，扭曲的肢体，诡异的外观，丑陋，AI感，噪点，网格感，JPEG压缩条纹，异常的肢体，水印，乱码，意义不明的字符"
-image = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    seed=0, cfg_scale=7, num_inference_steps=50,
-    positive_only_lora=lora,
-    sigma_shift=8
-)
-image.save("image.jpg")
--- a/Show More
+++ b/Show More