Compare commits

...

28 Commits

Author SHA1 Message Date
Artiprocher
add6f88324 bugfix 2026-03-03 15:33:42 +08:00
Zhongjie Duan
430b495100 Merge pull request #1321 from mi804/bugfix
fix qwen_text_encoder bug in transformers>=5.2.0
2026-03-03 13:02:45 +08:00
mi804
62ba8a3f2e fix qwen_text_encoder bug in transformers>=5.2.0 2026-03-03 12:44:36 +08:00
Zhongjie Duan
237d178733 Fix LoRA compatibility issues. (#1320) 2026-03-03 11:08:31 +08:00
Zhongjie Duan
b3ef224042 support Anima gradient checkpointing (#1319) 2026-03-02 19:06:55 +08:00
Zhongjie Duan
f43b18ec21 Update docs (#1318)
* update docs
2026-03-02 18:59:13 +08:00
Zhongjie Duan
6d671db5d2 Support Anima (#1317)
* support Anima

Co-authored-by: mi804 <1576993271@qq.com>
2026-03-02 18:49:02 +08:00
Zhongjie Duan
880231b4be Merge pull request #1315 from modelscope/docs2.0
update ltx-2 docs
2026-03-02 11:02:20 +08:00
mi804
b3f6c3275f update ltx-2 2026-03-02 10:58:02 +08:00
Zhongjie Duan
29cd5c7612 Merge pull request #1275 from Mr-Neutr0n/fix-dit-none-check
Fix AttributeError when pipe.dit is None during split training
2026-03-02 10:25:11 +08:00
Zhongjie Duan
ff4be1c7c7 Merge pull request #1293 from Mr-Neutr0n/fix/trajectory-loss-div-by-zero
fix: prevent division by zero in TrajectoryImitationLoss at final denoising step
2026-03-02 10:21:39 +08:00
Zhongjie Duan
6b0fb1601f Merge pull request #1296 from Explorer-Dong/fix/wan_vae
fix: WanVAE2.2 encode and decode error
2026-03-02 10:19:36 +08:00
Zhongjie Duan
4b400c07eb Merge pull request #1297 from Feng0w0/npu_fused
[doc][NPU]Documentation on modifications, NPU environment installation, and additional parameter
2026-03-02 10:16:01 +08:00
Zhongjie Duan
6a6ae6d791 Merge pull request #1312 from mi804/ltx2-iclora
Ltx2 iclora
2026-02-28 12:45:16 +08:00
mi804
1a380a6b62 minor fix 2026-02-28 11:09:10 +08:00
mi804
5ca74923e8 add readme 2026-02-28 10:56:08 +08:00
mi804
8b9a094c1b ltx iclora train 2026-02-27 18:43:53 +08:00
mi804
5996c2b068 support inference 2026-02-27 16:48:16 +08:00
Zhongjie Duan
8fc7e005a6 Merge pull request #1309 from mi804/ltx2-train
support ltx2 gradient_checkpointing
2026-02-26 19:31:04 +08:00
mi804
a18966c300 support ltx2 gradient_checkpointing 2026-02-26 19:19:59 +08:00
feng0w0
96143aa26b Merge branch 'npu_fused' of https://github.com/Feng0w0/DiffSynth-Studio into npu_fused 2026-02-13 10:06:39 +08:00
feng0w0
71cea4371c [doc][NPU]Documentation on modifications, NPU environment installation, and additional parameter 2026-02-13 09:58:27 +08:00
Mr_Dwj
fc11fd4297 chore: remove invalid comment code 2026-02-13 09:38:14 +08:00
Mr_Dwj
bd3c5822a1 fix: WanVAE2.2 decode error 2026-02-13 01:13:08 +08:00
Mr_Dwj
96fb0f3afe fix: unpack Resample38 output 2026-02-12 23:51:56 +08:00
Mr-Neutr0n
b68663426f fix: preserve sign of denominator in clamp to avoid inverting gradient direction
The previous .clamp(min=1e-6) on (sigma_ - sigma) flips the sign when
the denominator is negative (which is the typical case since sigmas
decrease monotonically). This would invert the target and cause
training divergence.

Use torch.sign(denom) * torch.clamp(denom.abs(), min=1e-6) instead,
which prevents division by zero while preserving the correct sign.
2026-02-11 21:04:55 +05:30
Mr-Neutr0n
0e6976a0ae fix: prevent division by zero in trajectory imitation loss at last step 2026-02-11 19:51:25 +05:30
Mr-Neutr0n
6383ec358c Fix AttributeError when pipe.dit is None
When using split training with 'sft:data_process' task, the DiT model
is not loaded but the attribute 'dit' exists with value None. The
existing hasattr check returns True but then accessing siglip_embedder
fails.

Add an explicit None check before accessing pipe.dit.siglip_embedder.

Fixes #1246
2026-02-07 05:23:11 +05:30
53 changed files with 2859 additions and 78 deletions

View File

@@ -32,6 +32,9 @@ We believe that a well-developed open-source code framework can lower the thresh
> DiffSynth-Studio has undergone major version updates, and some old features are no longer maintained. If you need to use old features, please switch to the [last historical version](https://github.com/modelscope/DiffSynth-Studio/tree/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3) before the major version update. > DiffSynth-Studio has undergone major version updates, and some old features are no longer maintained. If you need to use old features, please switch to the [last historical version](https://github.com/modelscope/DiffSynth-Studio/tree/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3) before the major version update.
> Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand. > Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.
- **March 2, 2026** Added support for [Anima](https://modelscope.cn/models/circlestone-labs/Anima). For details, please refer to the [documentation](docs/en/Model_Details/Anima.md). This is an interesting anime-style image generation model. We look forward to its future updates.
- **February 26, 2026** Added full and lora training support for the LTX-2 audio-video generation model. See the [documentation](/docs/en/Model_Details/LTX-2.md) for details. - **February 26, 2026** Added full and lora training support for the LTX-2 audio-video generation model. See the [documentation](/docs/en/Model_Details/LTX-2.md) for details.
- **February 10, 2026** Added inference support for the LTX-2 audio-video generation model. See the [documentation](/docs/en/Model_Details/LTX-2.md) for details. Support for model training will be implemented in the future. - **February 10, 2026** Added inference support for the LTX-2 audio-video generation model. See the [documentation](/docs/en/Model_Details/LTX-2.md) for details. Support for model training will be implemented in the future.
@@ -343,6 +346,60 @@ Example code for FLUX.2 is available at: [/examples/flux2/](/examples/flux2/)
</details> </details>
#### Anima: [/docs/en/Model_Details/Anima.md](/docs/en/Model_Details/Anima.md)
<details>
<summary>Quick Start</summary>
Run the following code to quickly load the [circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima) model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 8GB VRAM.
```python
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
vram_config = {
"offload_dtype": "disk",
"offload_device": "disk",
"onload_dtype": "disk",
"onload_device": "disk",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")
```
</details>
<details>
<summary>Examples</summary>
Example code for Anima is located at: [/examples/anima/](/examples/anima/)
| Model ID | Inference | Low VRAM Inference | Full Training | Validation after Full Training | LoRA Training | Validation after LoRA Training |
|-|-|-|-|-|-|-|
|[circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference_low_vram/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/full/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_full/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/lora/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_lora/anima-preview.py)|
</details>
#### Qwen-Image: [/docs/en/Model_Details/Qwen-Image.md](/docs/en/Model_Details/Qwen-Image.md) #### Qwen-Image: [/docs/en/Model_Details/Qwen-Image.md](/docs/en/Model_Details/Qwen-Image.md)
<details> <details>
@@ -645,6 +702,8 @@ Example code for LTX-2 is available at: [/examples/ltx2/](/examples/ltx2/)
| Model ID | Extra Args | Inference | Low-VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation | | Model ID | Extra Args | Inference | Low-VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
|-|-|-|-|-|-|-|-| |-|-|-|-|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)| |[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-| |[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-| |[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-| |[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|

View File

@@ -32,6 +32,9 @@ DiffSynth 目前包括两个开源项目:
> DiffSynth-Studio 经历了大版本更新,部分旧功能已停止维护,如需使用旧版功能,请切换到大版本更新前的[最后一个历史版本](https://github.com/modelscope/DiffSynth-Studio/tree/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3)。 > DiffSynth-Studio 经历了大版本更新,部分旧功能已停止维护,如需使用旧版功能,请切换到大版本更新前的[最后一个历史版本](https://github.com/modelscope/DiffSynth-Studio/tree/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3)。
> 目前本项目的开发人员有限,大部分工作由 [Artiprocher](https://github.com/Artiprocher) 负责因此新功能的开发进展会比较缓慢issue 的回复和解决速度有限,我们对此感到非常抱歉,请各位开发者理解。 > 目前本项目的开发人员有限,大部分工作由 [Artiprocher](https://github.com/Artiprocher) 负责因此新功能的开发进展会比较缓慢issue 的回复和解决速度有限,我们对此感到非常抱歉,请各位开发者理解。
- **2026年3月2日** 新增对[Anima](https://modelscope.cn/models/circlestone-labs/Anima)的支持,详见[文档](docs/zh/Model_Details/Anima.md)。这是一个有趣的动漫风格图像生成模型,我们期待其后续的模型更新。
- **2026年2月26日** 新增对[LTX-2](https://www.modelscope.cn/models/Lightricks/LTX-2)音视频生成模型全量微调与LoRA训练支持详见[文档](docs/zh/Model_Details/LTX-2.md)。 - **2026年2月26日** 新增对[LTX-2](https://www.modelscope.cn/models/Lightricks/LTX-2)音视频生成模型全量微调与LoRA训练支持详见[文档](docs/zh/Model_Details/LTX-2.md)。
- **2026年2月10日** 新增对[LTX-2](https://www.modelscope.cn/models/Lightricks/LTX-2)音视频生成模型的推理支持,详见[文档](docs/zh/Model_Details/LTX-2.md),后续将推进模型训练的支持。 - **2026年2月10日** 新增对[LTX-2](https://www.modelscope.cn/models/Lightricks/LTX-2)音视频生成模型的推理支持,详见[文档](docs/zh/Model_Details/LTX-2.md),后续将推进模型训练的支持。
@@ -343,6 +346,60 @@ FLUX.2 的示例代码位于:[/examples/flux2/](/examples/flux2/)
</details> </details>
#### Anima: [/docs/zh/Model_Details/Anima.md](/docs/zh/Model_Details/Anima.md)
<details>
<summary>快速开始</summary>
运行以下代码可以快速加载 [circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima) 模型并进行推理。显存管理已启动,框架会自动根据剩余显存控制模型参数的加载,最低 8G 显存即可运行。
```python
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
vram_config = {
"offload_dtype": "disk",
"offload_device": "disk",
"onload_dtype": "disk",
"onload_device": "disk",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")
```
</details>
<details>
<summary>示例代码</summary>
Anima 的示例代码位于:[/examples/anima/](/examples/anima/)
|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|-|-|-|-|-|-|-|
|[circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference_low_vram/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/full/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_full/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/lora/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_lora/anima-preview.py)|
</details>
#### Qwen-Image: [/docs/zh/Model_Details/Qwen-Image.md](/docs/zh/Model_Details/Qwen-Image.md) #### Qwen-Image: [/docs/zh/Model_Details/Qwen-Image.md](/docs/zh/Model_Details/Qwen-Image.md)
<details> <details>
@@ -645,6 +702,8 @@ LTX-2 的示例代码位于:[/examples/ltx2/](/examples/ltx2/)
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证| |模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|-|-|-|-|-|-|-|-| |-|-|-|-|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)| |[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-| |[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-| |[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-| |[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|

View File

@@ -1,2 +1,2 @@
from .model_configs import MODEL_CONFIGS from .model_configs import MODEL_CONFIGS
from .vram_management_module_maps import VRAM_MANAGEMENT_MODULE_MAPS from .vram_management_module_maps import VRAM_MANAGEMENT_MODULE_MAPS, VERSION_CHECKER_MAPS

View File

@@ -719,4 +719,20 @@ ltx2_series = [
"model_class": "diffsynth.models.ltx2_upsampler.LTX2LatentUpsampler", "model_class": "diffsynth.models.ltx2_upsampler.LTX2LatentUpsampler",
}, },
] ]
MODEL_CONFIGS = qwen_image_series + wan_series + flux_series + flux2_series + z_image_series + ltx2_series anima_series = [
{
# Example: ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors")
"model_hash": "a9995952c2d8e63cf82e115005eb61b9",
"model_name": "z_image_text_encoder",
"model_class": "diffsynth.models.z_image_text_encoder.ZImageTextEncoder",
"extra_kwargs": {"model_size": "0.6B"},
},
{
# Example: ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors")
"model_hash": "417673936471e79e31ed4d186d7a3f4a",
"model_name": "anima_dit",
"model_class": "diffsynth.models.anima_dit.AnimaDiT",
"state_dict_converter": "diffsynth.utils.state_dict_converters.anima_dit.AnimaDiTStateDictConverter",
}
]
MODEL_CONFIGS = qwen_image_series + wan_series + flux_series + flux2_series + z_image_series + ltx2_series + anima_series

View File

@@ -243,4 +243,24 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
"transformers.models.gemma3.modeling_gemma3.Gemma3RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule", "transformers.models.gemma3.modeling_gemma3.Gemma3RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
"transformers.models.gemma3.modeling_gemma3.Gemma3TextScaledWordEmbedding": "diffsynth.core.vram.layers.AutoWrappedModule", "transformers.models.gemma3.modeling_gemma3.Gemma3TextScaledWordEmbedding": "diffsynth.core.vram.layers.AutoWrappedModule",
}, },
"diffsynth.models.anima_dit.AnimaDiT": {
"torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
"torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
},
}
def QwenImageTextEncoder_Module_Map_Updater():
current = VRAM_MANAGEMENT_MODULE_MAPS["diffsynth.models.qwen_image_text_encoder.QwenImageTextEncoder"]
from packaging import version
import transformers
if version.parse(transformers.__version__) >= version.parse("5.2.0"):
# The Qwen2RMSNorm in transformers 5.2.0+ has been renamed to Qwen2_5_VLRMSNorm, so we need to update the module map accordingly
current.pop("transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2RMSNorm", None)
current["transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLRMSNorm"] = "diffsynth.core.vram.layers.AutoWrappedModule"
return current
VERSION_CHECKER_MAPS = {
"diffsynth.models.qwen_image_text_encoder.QwenImageTextEncoder": QwenImageTextEncoder_Module_Map_Updater,
} }

View File

@@ -94,19 +94,22 @@ class BasePipeline(torch.nn.Module):
return self return self
def check_resize_height_width(self, height, width, num_frames=None): def check_resize_height_width(self, height, width, num_frames=None, verbose=1):
# Shape check # Shape check
if height % self.height_division_factor != 0: if height % self.height_division_factor != 0:
height = (height + self.height_division_factor - 1) // self.height_division_factor * self.height_division_factor height = (height + self.height_division_factor - 1) // self.height_division_factor * self.height_division_factor
if verbose > 0:
print(f"height % {self.height_division_factor} != 0. We round it up to {height}.") print(f"height % {self.height_division_factor} != 0. We round it up to {height}.")
if width % self.width_division_factor != 0: if width % self.width_division_factor != 0:
width = (width + self.width_division_factor - 1) // self.width_division_factor * self.width_division_factor width = (width + self.width_division_factor - 1) // self.width_division_factor * self.width_division_factor
if verbose > 0:
print(f"width % {self.width_division_factor} != 0. We round it up to {width}.") print(f"width % {self.width_division_factor} != 0. We round it up to {width}.")
if num_frames is None: if num_frames is None:
return height, width return height, width
else: else:
if num_frames % self.time_division_factor != self.time_division_remainder: if num_frames % self.time_division_factor != self.time_division_remainder:
num_frames = (num_frames + self.time_division_factor - 1) // self.time_division_factor * self.time_division_factor + self.time_division_remainder num_frames = (num_frames + self.time_division_factor - 1) // self.time_division_factor * self.time_division_factor + self.time_division_remainder
if verbose > 0:
print(f"num_frames % {self.time_division_factor} != {self.time_division_remainder}. We round it up to {num_frames}.") print(f"num_frames % {self.time_division_factor} != {self.time_division_remainder}. We round it up to {num_frames}.")
return height, width, num_frames return height, width, num_frames

View File

@@ -121,7 +121,9 @@ class TrajectoryImitationLoss(torch.nn.Module):
progress_id_teacher = torch.argmin((timesteps_teacher - pipe.scheduler.timesteps[progress_id + 1]).abs()) progress_id_teacher = torch.argmin((timesteps_teacher - pipe.scheduler.timesteps[progress_id + 1]).abs())
latents_ = trajectory_teacher[progress_id_teacher] latents_ = trajectory_teacher[progress_id_teacher]
target = (latents_ - inputs_shared["latents"]) / (sigma_ - sigma) denom = sigma_ - sigma
denom = torch.sign(denom) * torch.clamp(denom.abs(), min=1e-6)
target = (latents_ - inputs_shared["latents"]) / denom
loss = loss + torch.nn.functional.mse_loss(noise_pred.float(), target.float()) * pipe.scheduler.training_weight(timestep) loss = loss + torch.nn.functional.mse_loss(noise_pred.float(), target.float()) * pipe.scheduler.training_weight(timestep)
return loss return loss

File diff suppressed because it is too large Load Diff

View File

@@ -8,6 +8,7 @@ import torch
from einops import rearrange from einops import rearrange
from .ltx2_common import rms_norm, Modality from .ltx2_common import rms_norm, Modality
from ..core.attention.attention import attention_forward from ..core.attention.attention import attention_forward
from ..core import gradient_checkpoint_forward
def get_timestep_embedding( def get_timestep_embedding(
@@ -1352,24 +1353,17 @@ class LTXModel(torch.nn.Module):
video: TransformerArgs | None, video: TransformerArgs | None,
audio: TransformerArgs | None, audio: TransformerArgs | None,
perturbations: BatchedPerturbationConfig, perturbations: BatchedPerturbationConfig,
use_gradient_checkpointing: bool = False,
use_gradient_checkpointing_offload: bool = False,
) -> tuple[TransformerArgs, TransformerArgs]: ) -> tuple[TransformerArgs, TransformerArgs]:
"""Process transformer blocks for LTXAV.""" """Process transformer blocks for LTXAV."""
# Process transformer blocks # Process transformer blocks
for block in self.transformer_blocks: for block in self.transformer_blocks:
if self._enable_gradient_checkpointing and self.training: video, audio = gradient_checkpoint_forward(
# Use gradient checkpointing to save memory during training.
# With use_reentrant=False, we can pass dataclasses directly -
# PyTorch will track all tensor leaves in the computation graph.
video, audio = torch.utils.checkpoint.checkpoint(
block, block,
video, use_gradient_checkpointing,
audio, use_gradient_checkpointing_offload,
perturbations,
use_reentrant=False,
)
else:
video, audio = block(
video=video, video=video,
audio=audio, audio=audio,
perturbations=perturbations, perturbations=perturbations,
@@ -1398,7 +1392,12 @@ class LTXModel(torch.nn.Module):
return x return x
def _forward( def _forward(
self, video: Modality | None, audio: Modality | None, perturbations: BatchedPerturbationConfig self,
video: Modality | None,
audio: Modality | None,
perturbations: BatchedPerturbationConfig,
use_gradient_checkpointing: bool = False,
use_gradient_checkpointing_offload: bool = False,
) -> tuple[torch.Tensor, torch.Tensor]: ) -> tuple[torch.Tensor, torch.Tensor]:
""" """
Forward pass for LTX models. Forward pass for LTX models.
@@ -1417,6 +1416,8 @@ class LTXModel(torch.nn.Module):
video=video_args, video=video_args,
audio=audio_args, audio=audio_args,
perturbations=perturbations, perturbations=perturbations,
use_gradient_checkpointing=use_gradient_checkpointing,
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
) )
# Process output # Process output
@@ -1440,12 +1441,12 @@ class LTXModel(torch.nn.Module):
) )
return vx, ax return vx, ax
def forward(self, video_latents, video_positions, video_context, video_timesteps, audio_latents, audio_positions, audio_context, audio_timesteps): def forward(self, video_latents, video_positions, video_context, video_timesteps, audio_latents, audio_positions, audio_context, audio_timesteps, use_gradient_checkpointing=False, use_gradient_checkpointing_offload=False):
cross_pe_max_pos = None cross_pe_max_pos = None
if self.model_type.is_video_enabled() and self.model_type.is_audio_enabled(): if self.model_type.is_video_enabled() and self.model_type.is_audio_enabled():
cross_pe_max_pos = max(self.positional_embedding_max_pos[0], self.audio_positional_embedding_max_pos[0]) cross_pe_max_pos = max(self.positional_embedding_max_pos[0], self.audio_positional_embedding_max_pos[0])
self._init_preprocessors(cross_pe_max_pos) self._init_preprocessors(cross_pe_max_pos)
video = Modality(video_latents, video_timesteps, video_positions, video_context) video = Modality(video_latents, video_timesteps, video_positions, video_context)
audio = Modality(audio_latents, audio_timesteps, audio_positions, audio_context) if audio_latents is not None else None audio = Modality(audio_latents, audio_timesteps, audio_positions, audio_context) if audio_latents is not None else None
vx, ax = self._forward(video=video, audio=audio, perturbations=None) vx, ax = self._forward(video=video, audio=audio, perturbations=None, use_gradient_checkpointing=use_gradient_checkpointing, use_gradient_checkpointing_offload=use_gradient_checkpointing_offload)
return vx, ax return vx, ax

View File

@@ -1,6 +1,6 @@
from ..core.loader import load_model, hash_model_file from ..core.loader import load_model, hash_model_file
from ..core.vram import AutoWrappedModule from ..core.vram import AutoWrappedModule
from ..configs import MODEL_CONFIGS, VRAM_MANAGEMENT_MODULE_MAPS from ..configs import MODEL_CONFIGS, VRAM_MANAGEMENT_MODULE_MAPS, VERSION_CHECKER_MAPS
import importlib, json, torch import importlib, json, torch
@@ -22,7 +22,8 @@ class ModelPool:
def fetch_module_map(self, model_class, vram_config): def fetch_module_map(self, model_class, vram_config):
if self.need_to_enable_vram_management(vram_config): if self.need_to_enable_vram_management(vram_config):
if model_class in VRAM_MANAGEMENT_MODULE_MAPS: if model_class in VRAM_MANAGEMENT_MODULE_MAPS:
module_map = {self.import_model_class(source): self.import_model_class(target) for source, target in VRAM_MANAGEMENT_MODULE_MAPS[model_class].items()} vram_module_map = VRAM_MANAGEMENT_MODULE_MAPS[model_class] if model_class not in VERSION_CHECKER_MAPS else VERSION_CHECKER_MAPS[model_class]()
module_map = {self.import_model_class(source): self.import_model_class(target) for source, target in vram_module_map.items()}
else: else:
module_map = {self.import_model_class(model_class): AutoWrappedModule} module_map = {self.import_model_class(model_class): AutoWrappedModule}
else: else:

View File

@@ -469,7 +469,7 @@ class Down_ResidualBlock(nn.Module):
def forward(self, x, feat_cache=None, feat_idx=[0]): def forward(self, x, feat_cache=None, feat_idx=[0]):
x_copy = x.clone() x_copy = x.clone()
for module in self.downsamples: for module in self.downsamples:
x = module(x, feat_cache, feat_idx) x, feat_cache, feat_idx = module(x, feat_cache, feat_idx)
return x + self.avg_shortcut(x_copy), feat_cache, feat_idx return x + self.avg_shortcut(x_copy), feat_cache, feat_idx
@@ -506,10 +506,10 @@ class Up_ResidualBlock(nn.Module):
def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False): def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False):
x_main = x.clone() x_main = x.clone()
for module in self.upsamples: for module in self.upsamples:
x_main = module(x_main, feat_cache, feat_idx) x_main, feat_cache, feat_idx = module(x_main, feat_cache, feat_idx)
if self.avg_shortcut is not None: if self.avg_shortcut is not None:
x_shortcut = self.avg_shortcut(x, first_chunk) x_shortcut = self.avg_shortcut(x, first_chunk)
return x_main + x_shortcut return x_main + x_shortcut, feat_cache, feat_idx
else: else:
return x_main, feat_cache, feat_idx return x_main, feat_cache, feat_idx

View File

@@ -0,0 +1,263 @@
import torch, math
from PIL import Image
from typing import Union
from tqdm import tqdm
from einops import rearrange
import numpy as np
from math import prod
from transformers import AutoTokenizer
from ..core.device.npu_compatible_device import get_device_type
from ..diffusion import FlowMatchScheduler
from ..core import ModelConfig, gradient_checkpoint_forward
from ..diffusion.base_pipeline import BasePipeline, PipelineUnit, ControlNetInput
from ..utils.lora.merge import merge_lora
from ..models.anima_dit import AnimaDiT
from ..models.z_image_text_encoder import ZImageTextEncoder
from ..models.wan_video_vae import WanVideoVAE
class AnimaImagePipeline(BasePipeline):
def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
super().__init__(
device=device, torch_dtype=torch_dtype,
height_division_factor=16, width_division_factor=16,
)
self.scheduler = FlowMatchScheduler("Z-Image")
self.text_encoder: ZImageTextEncoder = None
self.dit: AnimaDiT = None
self.vae: WanVideoVAE = None
self.tokenizer: AutoTokenizer = None
self.tokenizer_t5xxl: AutoTokenizer = None
self.in_iteration_models = ("dit",)
self.units = [
AnimaUnit_ShapeChecker(),
AnimaUnit_NoiseInitializer(),
AnimaUnit_InputImageEmbedder(),
AnimaUnit_PromptEmbedder(),
]
self.model_fn = model_fn_anima
@staticmethod
def from_pretrained(
torch_dtype: torch.dtype = torch.bfloat16,
device: Union[str, torch.device] = get_device_type(),
model_configs: list[ModelConfig] = [],
tokenizer_config: ModelConfig = ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config: ModelConfig = ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit: float = None,
):
# Initialize pipeline
pipe = AnimaImagePipeline(device=device, torch_dtype=torch_dtype)
model_pool = pipe.download_and_load_models(model_configs, vram_limit)
# Fetch models
pipe.text_encoder = model_pool.fetch_model("z_image_text_encoder")
pipe.dit = model_pool.fetch_model("anima_dit")
pipe.vae = model_pool.fetch_model("wan_video_vae")
if tokenizer_config is not None:
tokenizer_config.download_if_necessary()
pipe.tokenizer = AutoTokenizer.from_pretrained(tokenizer_config.path)
if tokenizer_t5xxl_config is not None:
tokenizer_t5xxl_config.download_if_necessary()
pipe.tokenizer_t5xxl = AutoTokenizer.from_pretrained(tokenizer_t5xxl_config.path)
# VRAM Management
pipe.vram_management_enabled = pipe.check_vram_management_state()
return pipe
@torch.no_grad()
def __call__(
self,
# Prompt
prompt: str,
negative_prompt: str = "",
cfg_scale: float = 4.0,
# Image
input_image: Image.Image = None,
denoising_strength: float = 1.0,
# Shape
height: int = 1024,
width: int = 1024,
# Randomness
seed: int = None,
rand_device: str = "cpu",
# Steps
num_inference_steps: int = 30,
sigma_shift: float = None,
# Progress bar
progress_bar_cmd = tqdm,
):
# Scheduler
self.scheduler.set_timesteps(num_inference_steps, denoising_strength=denoising_strength, shift=sigma_shift)
# Parameters
inputs_posi = {
"prompt": prompt,
}
inputs_nega = {
"negative_prompt": negative_prompt,
}
inputs_shared = {
"cfg_scale": cfg_scale,
"input_image": input_image, "denoising_strength": denoising_strength,
"height": height, "width": width,
"seed": seed, "rand_device": rand_device,
"num_inference_steps": num_inference_steps,
}
for unit in self.units:
inputs_shared, inputs_posi, inputs_nega = self.unit_runner(unit, self, inputs_shared, inputs_posi, inputs_nega)
# Denoise
self.load_models_to_device(self.in_iteration_models)
models = {name: getattr(self, name) for name in self.in_iteration_models}
for progress_id, timestep in enumerate(progress_bar_cmd(self.scheduler.timesteps)):
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)
noise_pred = self.cfg_guided_model_fn(
self.model_fn, cfg_scale,
inputs_shared, inputs_posi, inputs_nega,
**models, timestep=timestep, progress_id=progress_id
)
inputs_shared["latents"] = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **inputs_shared)
# Decode
self.load_models_to_device(['vae'])
image = self.vae.decode(inputs_shared["latents"].unsqueeze(2), device=self.device).squeeze(2)
image = self.vae_output_to_image(image)
self.load_models_to_device([])
return image
class AnimaUnit_ShapeChecker(PipelineUnit):
def __init__(self):
super().__init__(
input_params=("height", "width"),
output_params=("height", "width"),
)
def process(self, pipe: AnimaImagePipeline, height, width):
height, width = pipe.check_resize_height_width(height, width)
return {"height": height, "width": width}
class AnimaUnit_NoiseInitializer(PipelineUnit):
def __init__(self):
super().__init__(
input_params=("height", "width", "seed", "rand_device"),
output_params=("noise",),
)
def process(self, pipe: AnimaImagePipeline, height, width, seed, rand_device):
noise = pipe.generate_noise((1, 16, height//8, width//8), seed=seed, rand_device=rand_device, rand_torch_dtype=pipe.torch_dtype)
return {"noise": noise}
class AnimaUnit_InputImageEmbedder(PipelineUnit):
def __init__(self):
super().__init__(
input_params=("input_image", "noise"),
output_params=("latents", "input_latents"),
onload_model_names=("vae",)
)
def process(self, pipe: AnimaImagePipeline, input_image, noise):
if input_image is None:
return {"latents": noise, "input_latents": None}
pipe.load_models_to_device(['vae'])
if isinstance(input_image, list):
input_latents = []
for image in input_image:
image = pipe.preprocess_image(image).to(device=pipe.device, dtype=pipe.torch_dtype)
input_latents.append(pipe.vae.encode(image))
input_latents = torch.concat(input_latents, dim=0)
else:
image = pipe.preprocess_image(input_image).to(device=pipe.device, dtype=pipe.torch_dtype)
input_latents = pipe.vae.encode(image.unsqueeze(2), device=pipe.device).squeeze(2)
if pipe.scheduler.training:
return {"latents": noise, "input_latents": input_latents}
else:
latents = pipe.scheduler.add_noise(input_latents, noise, timestep=pipe.scheduler.timesteps[0])
return {"latents": latents, "input_latents": input_latents}
class AnimaUnit_PromptEmbedder(PipelineUnit):
def __init__(self):
super().__init__(
seperate_cfg=True,
input_params_posi={"prompt": "prompt"},
input_params_nega={"prompt": "negative_prompt"},
output_params=("prompt_emb",),
onload_model_names=("text_encoder",)
)
def encode_prompt(
self,
pipe: AnimaImagePipeline,
prompt,
device = None,
max_sequence_length: int = 512,
):
if isinstance(prompt, str):
prompt = [prompt]
text_inputs = pipe.tokenizer(
prompt,
padding="max_length",
max_length=max_sequence_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids.to(device)
prompt_masks = text_inputs.attention_mask.to(device).bool()
prompt_embeds = pipe.text_encoder(
input_ids=text_input_ids,
attention_mask=prompt_masks,
output_hidden_states=True,
).hidden_states[-1]
t5xxl_text_inputs = pipe.tokenizer_t5xxl(
prompt,
max_length=max_sequence_length,
truncation=True,
return_tensors="pt",
)
t5xxl_ids = t5xxl_text_inputs.input_ids.to(device)
return prompt_embeds.to(pipe.torch_dtype), t5xxl_ids
def process(self, pipe: AnimaImagePipeline, prompt):
pipe.load_models_to_device(self.onload_model_names)
prompt_embeds, t5xxl_ids = self.encode_prompt(pipe, prompt, pipe.device)
return {"prompt_emb": prompt_embeds, "t5xxl_ids": t5xxl_ids}
def model_fn_anima(
dit: AnimaDiT = None,
latents=None,
timestep=None,
prompt_emb=None,
t5xxl_ids=None,
use_gradient_checkpointing=False,
use_gradient_checkpointing_offload=False,
**kwargs
):
latents = latents.unsqueeze(2)
timestep = timestep / 1000
model_output = dit(
x=latents,
timesteps=timestep,
context=prompt_emb,
t5xxl_ids=t5xxl_ids,
use_gradient_checkpointing=use_gradient_checkpointing,
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
)
model_output = model_output.squeeze(2)
return model_output

View File

@@ -61,6 +61,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
LTX2AudioVideoUnit_InputAudioEmbedder(), LTX2AudioVideoUnit_InputAudioEmbedder(),
LTX2AudioVideoUnit_InputVideoEmbedder(), LTX2AudioVideoUnit_InputVideoEmbedder(),
LTX2AudioVideoUnit_InputImagesEmbedder(), LTX2AudioVideoUnit_InputImagesEmbedder(),
LTX2AudioVideoUnit_InContextVideoEmbedder(),
] ]
self.model_fn = model_fn_ltx2 self.model_fn = model_fn_ltx2
@@ -105,6 +106,8 @@ class LTX2AudioVideoPipeline(BasePipeline):
def stage2_denoise(self, inputs_shared, inputs_posi, inputs_nega, progress_bar_cmd=tqdm): def stage2_denoise(self, inputs_shared, inputs_posi, inputs_nega, progress_bar_cmd=tqdm):
if inputs_shared["use_two_stage_pipeline"]: if inputs_shared["use_two_stage_pipeline"]:
if inputs_shared.get("clear_lora_before_state_two", False):
self.clear_lora()
latent = self.video_vae_encoder.per_channel_statistics.un_normalize(inputs_shared["video_latents"]) latent = self.video_vae_encoder.per_channel_statistics.un_normalize(inputs_shared["video_latents"])
self.load_models_to_device('upsampler',) self.load_models_to_device('upsampler',)
latent = self.upsampler(latent) latent = self.upsampler(latent)
@@ -112,11 +115,17 @@ class LTX2AudioVideoPipeline(BasePipeline):
self.scheduler.set_timesteps(special_case="stage2") self.scheduler.set_timesteps(special_case="stage2")
inputs_shared.update({k.replace("stage2_", ""): v for k, v in inputs_shared.items() if k.startswith("stage2_")}) inputs_shared.update({k.replace("stage2_", ""): v for k, v in inputs_shared.items() if k.startswith("stage2_")})
denoise_mask_video = 1.0 denoise_mask_video = 1.0
# input image
if inputs_shared.get("input_images", None) is not None: if inputs_shared.get("input_images", None) is not None:
latent, denoise_mask_video, initial_latents = self.apply_input_images_to_latents( latent, denoise_mask_video, initial_latents = self.apply_input_images_to_latents(
latent, inputs_shared.pop("input_latents"), inputs_shared["input_images_indexes"], latent, inputs_shared.pop("input_latents"), inputs_shared["input_images_indexes"],
inputs_shared["input_images_strength"], latent.clone()) inputs_shared["input_images_strength"], latent.clone())
inputs_shared.update({"input_latents_video": initial_latents, "denoise_mask_video": denoise_mask_video}) inputs_shared.update({"input_latents_video": initial_latents, "denoise_mask_video": denoise_mask_video})
# remove in-context video control in stage 2
inputs_shared.pop("in_context_video_latents", None)
inputs_shared.pop("in_context_video_positions", None)
# initialize latents for stage 2
inputs_shared["video_latents"] = self.scheduler.sigmas[0] * denoise_mask_video * inputs_shared[ inputs_shared["video_latents"] = self.scheduler.sigmas[0] * denoise_mask_video * inputs_shared[
"video_noise"] + (1 - self.scheduler.sigmas[0] * denoise_mask_video) * latent "video_noise"] + (1 - self.scheduler.sigmas[0] * denoise_mask_video) * latent
inputs_shared["audio_latents"] = self.scheduler.sigmas[0] * inputs_shared["audio_noise"] + ( inputs_shared["audio_latents"] = self.scheduler.sigmas[0] * inputs_shared["audio_noise"] + (
@@ -145,11 +154,14 @@ class LTX2AudioVideoPipeline(BasePipeline):
# Prompt # Prompt
prompt: str, prompt: str,
negative_prompt: Optional[str] = "", negative_prompt: Optional[str] = "",
# Image-to-video
denoising_strength: float = 1.0, denoising_strength: float = 1.0,
# Image-to-video
input_images: Optional[list[Image.Image]] = None, input_images: Optional[list[Image.Image]] = None,
input_images_indexes: Optional[list[int]] = None, input_images_indexes: Optional[list[int]] = None,
input_images_strength: Optional[float] = 1.0, input_images_strength: Optional[float] = 1.0,
# In-Context Video Control
in_context_videos: Optional[list[list[Image.Image]]] = None,
in_context_downsample_factor: Optional[int] = 2,
# Randomness # Randomness
seed: Optional[int] = None, seed: Optional[int] = None,
rand_device: Optional[str] = "cpu", rand_device: Optional[str] = "cpu",
@@ -157,6 +169,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
height: Optional[int] = 512, height: Optional[int] = 512,
width: Optional[int] = 768, width: Optional[int] = 768,
num_frames=121, num_frames=121,
frame_rate=24,
# Classifier-free guidance # Classifier-free guidance
cfg_scale: Optional[float] = 3.0, cfg_scale: Optional[float] = 3.0,
# Scheduler # Scheduler
@@ -169,6 +182,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
tile_overlap_in_frames: Optional[int] = 24, tile_overlap_in_frames: Optional[int] = 24,
# Special Pipelines # Special Pipelines
use_two_stage_pipeline: Optional[bool] = False, use_two_stage_pipeline: Optional[bool] = False,
clear_lora_before_state_two: Optional[bool] = False,
use_distilled_pipeline: Optional[bool] = False, use_distilled_pipeline: Optional[bool] = False,
# progress_bar # progress_bar
progress_bar_cmd=tqdm, progress_bar_cmd=tqdm,
@@ -185,12 +199,13 @@ class LTX2AudioVideoPipeline(BasePipeline):
} }
inputs_shared = { inputs_shared = {
"input_images": input_images, "input_images_indexes": input_images_indexes, "input_images_strength": input_images_strength, "input_images": input_images, "input_images_indexes": input_images_indexes, "input_images_strength": input_images_strength,
"in_context_videos": in_context_videos, "in_context_downsample_factor": in_context_downsample_factor,
"seed": seed, "rand_device": rand_device, "seed": seed, "rand_device": rand_device,
"height": height, "width": width, "num_frames": num_frames, "height": height, "width": width, "num_frames": num_frames, "frame_rate": frame_rate,
"cfg_scale": cfg_scale, "cfg_scale": cfg_scale,
"tiled": tiled, "tile_size_in_pixels": tile_size_in_pixels, "tile_overlap_in_pixels": tile_overlap_in_pixels, "tiled": tiled, "tile_size_in_pixels": tile_size_in_pixels, "tile_overlap_in_pixels": tile_overlap_in_pixels,
"tile_size_in_frames": tile_size_in_frames, "tile_overlap_in_frames": tile_overlap_in_frames, "tile_size_in_frames": tile_size_in_frames, "tile_overlap_in_frames": tile_overlap_in_frames,
"use_two_stage_pipeline": use_two_stage_pipeline, "use_distilled_pipeline": use_distilled_pipeline, "use_two_stage_pipeline": use_two_stage_pipeline, "use_distilled_pipeline": use_distilled_pipeline, "clear_lora_before_state_two": clear_lora_before_state_two,
"video_patchifier": self.video_patchifier, "audio_patchifier": self.audio_patchifier, "video_patchifier": self.video_patchifier, "audio_patchifier": self.audio_patchifier,
} }
for unit in self.units: for unit in self.units:
@@ -417,8 +432,8 @@ class LTX2AudioVideoUnit_PromptEmbedder(PipelineUnit):
class LTX2AudioVideoUnit_NoiseInitializer(PipelineUnit): class LTX2AudioVideoUnit_NoiseInitializer(PipelineUnit):
def __init__(self): def __init__(self):
super().__init__( super().__init__(
input_params=("height", "width", "num_frames", "seed", "rand_device", "use_two_stage_pipeline"), input_params=("height", "width", "num_frames", "seed", "rand_device", "frame_rate", "use_two_stage_pipeline"),
output_params=("video_noise", "audio_noise",), output_params=("video_noise", "audio_noise", "video_positions", "audio_positions", "video_latent_shape", "audio_latent_shape")
) )
def process_stage(self, pipe: LTX2AudioVideoPipeline, height, width, num_frames, seed, rand_device, frame_rate=24.0): def process_stage(self, pipe: LTX2AudioVideoPipeline, height, width, num_frames, seed, rand_device, frame_rate=24.0):
@@ -471,7 +486,6 @@ class LTX2AudioVideoUnit_InputVideoEmbedder(PipelineUnit):
if pipe.scheduler.training: if pipe.scheduler.training:
return {"video_latents": input_latents, "input_latents": input_latents} return {"video_latents": input_latents, "input_latents": input_latents}
else: else:
# TODO: implement video-to-video
raise NotImplementedError("Video-to-video not implemented yet.") raise NotImplementedError("Video-to-video not implemented yet.")
class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit): class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit):
@@ -495,14 +509,13 @@ class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit):
if pipe.scheduler.training: if pipe.scheduler.training:
return {"audio_latents": audio_input_latents, "audio_input_latents": audio_input_latents, "audio_positions": audio_positions, "audio_latent_shape": audio_latent_shape} return {"audio_latents": audio_input_latents, "audio_input_latents": audio_input_latents, "audio_positions": audio_positions, "audio_latent_shape": audio_latent_shape}
else: else:
# TODO: implement video-to-video raise NotImplementedError("Audio-to-video not supported.")
raise NotImplementedError("Video-to-video not implemented yet.")
class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit): class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit):
def __init__(self): def __init__(self):
super().__init__( super().__init__(
input_params=("input_images", "input_images_indexes", "input_images_strength", "video_latents", "height", "width", "num_frames", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"), input_params=("input_images", "input_images_indexes", "input_images_strength", "video_latents", "height", "width", "num_frames", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"),
output_params=("video_latents"), output_params=("video_latents", "denoise_mask_video", "input_latents_video", "stage2_input_latents"),
onload_model_names=("video_vae_encoder") onload_model_names=("video_vae_encoder")
) )
@@ -537,6 +550,54 @@ class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit):
return output_dicts return output_dicts
class LTX2AudioVideoUnit_InContextVideoEmbedder(PipelineUnit):
def __init__(self):
super().__init__(
input_params=("in_context_videos", "height", "width", "num_frames", "frame_rate", "in_context_downsample_factor", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"),
output_params=("in_context_video_latents", "in_context_video_positions"),
onload_model_names=("video_vae_encoder")
)
def check_in_context_video(self, pipe, in_context_video, height, width, num_frames, in_context_downsample_factor, use_two_stage_pipeline=True):
if in_context_video is None or len(in_context_video) == 0:
raise ValueError("In-context video is None or empty.")
in_context_video = in_context_video[:num_frames]
expected_height = height // in_context_downsample_factor // 2 if use_two_stage_pipeline else height // in_context_downsample_factor
expected_width = width // in_context_downsample_factor // 2 if use_two_stage_pipeline else width // in_context_downsample_factor
current_h, current_w, current_f = in_context_video[0].size[1], in_context_video[0].size[0], len(in_context_video)
h, w, f = pipe.check_resize_height_width(expected_height, expected_width, current_f, verbose=0)
if current_h != h or current_w != w:
in_context_video = [img.resize((w, h)) for img in in_context_video]
if current_f != f:
# pad black frames at the end
in_context_video = in_context_video + [Image.new("RGB", (w, h), (0, 0, 0))] * (f - current_f)
return in_context_video
def process(self, pipe: LTX2AudioVideoPipeline, in_context_videos, height, width, num_frames, frame_rate, in_context_downsample_factor, tiled, tile_size_in_pixels, tile_overlap_in_pixels, use_two_stage_pipeline=True):
if in_context_videos is None or len(in_context_videos) == 0:
return {}
else:
pipe.load_models_to_device(self.onload_model_names)
latents, positions = [], []
for in_context_video in in_context_videos:
in_context_video = self.check_in_context_video(pipe, in_context_video, height, width, num_frames, in_context_downsample_factor, use_two_stage_pipeline)
in_context_video = pipe.preprocess_video(in_context_video)
in_context_latents = pipe.video_vae_encoder.encode(in_context_video, tiled, tile_size_in_pixels, tile_overlap_in_pixels).to(dtype=pipe.torch_dtype, device=pipe.device)
latent_coords = pipe.video_patchifier.get_patch_grid_bounds(output_shape=VideoLatentShape.from_torch_shape(in_context_latents.shape), device=pipe.device)
video_positions = get_pixel_coords(latent_coords, VIDEO_SCALE_FACTORS, True).float()
video_positions[:, 0, ...] = video_positions[:, 0, ...] / frame_rate
video_positions[:, 1, ...] *= in_context_downsample_factor # height axis
video_positions[:, 2, ...] *= in_context_downsample_factor # width axis
video_positions = video_positions.to(pipe.torch_dtype)
latents.append(in_context_latents)
positions.append(video_positions)
latents = torch.cat(latents, dim=1)
positions = torch.cat(positions, dim=1)
return {"in_context_video_latents": latents, "in_context_video_positions": positions}
def model_fn_ltx2( def model_fn_ltx2(
dit: LTXModel, dit: LTXModel,
video_latents=None, video_latents=None,
@@ -549,6 +610,8 @@ def model_fn_ltx2(
audio_patchifier=None, audio_patchifier=None,
timestep=None, timestep=None,
denoise_mask_video=None, denoise_mask_video=None,
in_context_video_latents=None,
in_context_video_positions=None,
use_gradient_checkpointing=False, use_gradient_checkpointing=False,
use_gradient_checkpointing_offload=False, use_gradient_checkpointing_offload=False,
**kwargs, **kwargs,
@@ -558,16 +621,25 @@ def model_fn_ltx2(
# patchify # patchify
b, c_v, f, h, w = video_latents.shape b, c_v, f, h, w = video_latents.shape
video_latents = video_patchifier.patchify(video_latents) video_latents = video_patchifier.patchify(video_latents)
seq_len_video = video_latents.shape[1]
video_timesteps = timestep.repeat(1, video_latents.shape[1], 1) video_timesteps = timestep.repeat(1, video_latents.shape[1], 1)
if denoise_mask_video is not None: if denoise_mask_video is not None:
video_timesteps = video_patchifier.patchify(denoise_mask_video) * video_timesteps video_timesteps = video_patchifier.patchify(denoise_mask_video) * video_timesteps
if in_context_video_latents is not None:
in_context_video_latents = video_patchifier.patchify(in_context_video_latents)
in_context_video_timesteps = timestep.repeat(1, in_context_video_latents.shape[1], 1) * 0.
video_latents = torch.cat([video_latents, in_context_video_latents], dim=1)
video_positions = torch.cat([video_positions, in_context_video_positions], dim=2)
video_timesteps = torch.cat([video_timesteps, in_context_video_timesteps], dim=1)
if audio_latents is not None: if audio_latents is not None:
_, c_a, _, mel_bins = audio_latents.shape _, c_a, _, mel_bins = audio_latents.shape
audio_latents = audio_patchifier.patchify(audio_latents) audio_latents = audio_patchifier.patchify(audio_latents)
audio_timesteps = timestep.repeat(1, audio_latents.shape[1], 1) audio_timesteps = timestep.repeat(1, audio_latents.shape[1], 1)
else: else:
audio_timesteps = None audio_timesteps = None
#TODO: support gradient checkpointing in training
vx, ax = dit( vx, ax = dit(
video_latents=video_latents, video_latents=video_latents,
video_positions=video_positions, video_positions=video_positions,
@@ -577,7 +649,11 @@ def model_fn_ltx2(
audio_positions=audio_positions, audio_positions=audio_positions,
audio_context=audio_context, audio_context=audio_context,
audio_timesteps=audio_timesteps, audio_timesteps=audio_timesteps,
use_gradient_checkpointing=use_gradient_checkpointing,
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
) )
vx = vx[:, :seq_len_video, ...]
# unpatchify # unpatchify
vx = video_patchifier.unpatchify_video(vx, f, h, w) vx = video_patchifier.unpatchify_video(vx, f, h, w)
ax = audio_patchifier.unpatchify_audio(ax, c_a, mel_bins) if ax is not None else None ax = audio_patchifier.unpatchify_audio(ax, c_a, mel_bins) if ax is not None else None

View File

@@ -299,7 +299,7 @@ class ZImageUnit_PromptEmbedder(PipelineUnit):
def process(self, pipe: ZImagePipeline, prompt, edit_image): def process(self, pipe: ZImagePipeline, prompt, edit_image):
pipe.load_models_to_device(self.onload_model_names) pipe.load_models_to_device(self.onload_model_names)
if hasattr(pipe, "dit") and pipe.dit.siglip_embedder is not None: if hasattr(pipe, "dit") and pipe.dit is not None and pipe.dit.siglip_embedder is not None:
# Z-Image-Turbo and Z-Image-Omni-Base use different prompt encoding methods. # Z-Image-Turbo and Z-Image-Omni-Base use different prompt encoding methods.
# We determine which encoding method to use based on the model architecture. # We determine which encoding method to use based on the model architecture.
# If you are using two-stage split training, # If you are using two-stage split training,

View File

@@ -116,7 +116,7 @@ class VideoData:
if self.height is not None and self.width is not None: if self.height is not None and self.width is not None:
return self.height, self.width return self.height, self.width
else: else:
height, width, _ = self.__getitem__(0).shape width, height = self.__getitem__(0).size
return height, width return height, width
def __getitem__(self, item): def __getitem__(self, item):

View File

@@ -1,4 +1,4 @@
import torch import torch, warnings
class GeneralLoRALoader: class GeneralLoRALoader:
@@ -26,7 +26,11 @@ class GeneralLoRALoader:
keys.pop(0) keys.pop(0)
keys.pop(-1) keys.pop(-1)
target_name = ".".join(keys) target_name = ".".join(keys)
lora_name_dict[target_name] = (key, key.replace(lora_B_key, lora_A_key)) # Alpha: Deprecated but retained for compatibility.
key_alpha = key.replace(lora_B_key + ".weight", "alpha").replace(lora_B_key + ".default.weight", "alpha")
if key_alpha == key or key_alpha not in lora_state_dict:
key_alpha = None
lora_name_dict[target_name] = (key, key.replace(lora_B_key, lora_A_key), key_alpha)
return lora_name_dict return lora_name_dict
@@ -36,6 +40,10 @@ class GeneralLoRALoader:
for name in name_dict: for name in name_dict:
weight_up = state_dict[name_dict[name][0]] weight_up = state_dict[name_dict[name][0]]
weight_down = state_dict[name_dict[name][1]] weight_down = state_dict[name_dict[name][1]]
if name_dict[name][2] is not None:
warnings.warn("Alpha detected in the LoRA file. This may be a LoRA model not trained by DiffSynth-Studio. To ensure compatibility, the LoRA weights will be converted to weight * alpha / rank.")
alpha = state_dict[name_dict[name][2]] / weight_down.shape[0]
weight_down = weight_down * alpha
state_dict_[name + f".lora_B{suffix}"] = weight_up state_dict_[name + f".lora_B{suffix}"] = weight_up
state_dict_[name + f".lora_A{suffix}"] = weight_down state_dict_[name + f".lora_A{suffix}"] = weight_down
return state_dict_ return state_dict_

View File

@@ -0,0 +1,6 @@
def AnimaDiTStateDictConverter(state_dict):
new_state_dict = {}
for key in state_dict:
value = state_dict[key]
new_state_dict[key.replace("net.", "")] = value
return new_state_dict

View File

@@ -0,0 +1,139 @@
# Anima
Anima is an image generation model trained and open-sourced by CircleStone Labs and Comfy Org.
## Installation
Before using this project for model inference and training, please install DiffSynth-Studio first.
```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
For more installation information, please refer to [Install Dependencies](../Pipeline_Usage/Setup.md).
## Quick Start
The following code demonstrates how to quickly load the [circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima) model for inference. VRAM management is enabled by default, allowing the framework to automatically control model parameter loading based on available VRAM. Minimum 8GB VRAM required.
```python
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
vram_config = {
"offload_dtype": "disk",
"offload_device": "disk",
"onload_dtype": "disk",
"onload_device": "disk",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")
```
## Model Overview
|Model ID|Inference|Low VRAM Inference|Full Training|Validation after Full Training|LoRA Training|Validation after LoRA Training|
|-|-|-|-|-|-|-|
|[circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference_low_vram/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/full/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_full/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/lora/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_lora/anima-preview.py)|
Special training scripts:
* Differential LoRA Training: [doc](../Training/Differential_LoRA.md)
* FP8 Precision Training: [doc](../Training/FP8_Precision.md)
* Two-Stage Split Training: [doc](../Training/Split_Training.md)
* End-to-End Direct Distillation: [doc](../Training/Direct_Distill.md)
## Model Inference
Models are loaded through `AnimaImagePipeline.from_pretrained`, see [Model Inference](../Pipeline_Usage/Model_Inference.md#loading-models) for details.
Input parameters for `AnimaImagePipeline` inference include:
* `prompt`: Text description of the desired image content.
* `negative_prompt`: Content to exclude from the generated image (default: `""`).
* `cfg_scale`: Classifier-free guidance parameter (default: 4.0).
* `input_image`: Input image for image-to-image generation (default: `None`).
* `denoising_strength`: Controls similarity to input image (default: 1.0).
* `height`: Image height (must be multiple of 16, default: 1024).
* `width`: Image width (must be multiple of 16, default: 1024).
* `seed`: Random seed (default: `None`).
* `rand_device`: Device for random noise generation (default: `"cpu"`).
* `num_inference_steps`: Inference steps (default: 30).
* `sigma_shift`: Scheduler sigma offset (default: `None`).
* `progress_bar_cmd`: Progress bar implementation (default: `tqdm.tqdm`).
For VRAM constraints, enable [VRAM Management](../Pipeline_Usage/VRAM_management.md). Recommended low-VRAM configurations are provided in the "Model Overview" table above.
## Model Training
Anima models are trained through [`examples/anima/model_training/train.py`](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/train.py) with parameters including:
* General Training Parameters
* Dataset Configuration
* `--dataset_base_path`: Dataset root directory.
* `--dataset_metadata_path`: Metadata file path.
* `--dataset_repeat`: Dataset repetition per epoch.
* `--dataset_num_workers`: Dataloader worker count.
* `--data_file_keys`: Metadata fields to load (comma-separated).
* Model Loading
* `--model_paths`: Model paths (JSON format).
* `--model_id_with_origin_paths`: Model IDs with origin paths (e.g., `"anima-team/anima-1B:text_encoder/*.safetensors"`).
* `--extra_inputs`: Additional pipeline inputs (e.g., `controlnet_inputs` for ControlNet).
* `--fp8_models`: FP8-formatted models (same format as `--model_paths`).
* Training Configuration
* `--learning_rate`: Learning rate.
* `--num_epochs`: Training epochs.
* `--trainable_models`: Trainable components (e.g., `dit`, `vae`, `text_encoder`).
* `--find_unused_parameters`: Handle unused parameters in DDP training.
* `--weight_decay`: Weight decay value.
* `--task`: Training task (default: `sft`).
* Output Configuration
* `--output_path`: Model output directory.
* `--remove_prefix_in_ckpt`: Remove state dict prefixes.
* `--save_steps`: Model saving interval.
* LoRA Configuration
* `--lora_base_model`: Target model for LoRA.
* `--lora_target_modules`: Target modules for LoRA.
* `--lora_rank`: LoRA rank.
* `--lora_checkpoint`: LoRA checkpoint path.
* `--preset_lora_path`: Preloaded LoRA checkpoint path.
* `--preset_lora_model`: Model to merge LoRA with (e.g., `dit`).
* Gradient Configuration
* `--use_gradient_checkpointing`: Enable gradient checkpointing.
* `--use_gradient_checkpointing_offload`: Offload checkpointing to CPU.
* `--gradient_accumulation_steps`: Gradient accumulation steps.
* Image Resolution
* `--height`: Image height (empty for dynamic resolution).
* `--width`: Image width (empty for dynamic resolution).
* `--max_pixels`: Maximum pixel area for dynamic resolution.
* Anima-Specific Parameters
* `--tokenizer_path`: Tokenizer path for text-to-image models.
* `--tokenizer_t5xxl_path`: T5-XXL tokenizer path.
We provide a sample image dataset for testing:
```shell
modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
```
For training script details, refer to [Model Training](../Pipeline_Usage/Model_Training.md). For advanced training techniques, see [Training Framework Documentation](https://github.com/modelscope/DiffSynth-Studio/tree/main/docs/zh/Training/).

View File

@@ -112,6 +112,8 @@ write_video_audio_ltx2(
|Model ID|Additional Parameters|Inference|Low VRAM Inference|Full Training|Validation After Full Training|LoRA Training|Validation After LoRA Training| |Model ID|Additional Parameters|Inference|Low VRAM Inference|Full Training|Validation After Full Training|LoRA Training|Validation After LoRA Training|
|-|-|-|-|-|-|-|-| |-|-|-|-|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)| |[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-| |[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-| |[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-| |[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|

View File

@@ -91,3 +91,4 @@ Set 0 or not set: indicates not enabling the binding function
|----------------|---------------------------|-------------------| |----------------|---------------------------|-------------------|
| Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU | | Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU |
| Qwen-Image series | --initialize_model_on_cpu | The model needs to be initialized on the CPU | | Qwen-Image series | --initialize_model_on_cpu | The model needs to be initialized on the CPU |
| Z-Image series | --enable_npu_patch | Using NPU fusion operator to replace the corresponding operator in Z-image model to improve the performance of the model on NPU |

View File

@@ -37,9 +37,9 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6
git clone https://github.com/modelscope/DiffSynth-Studio.git git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio cd DiffSynth-Studio
# aarch64/ARM # aarch64/ARM
pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu" pip install -e .[npu_aarch64]
# x86 # x86
pip install -e .[npu] pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu"
When using Ascend NPU, please replace `"cuda"` with `"npu"` in your Python code. For details, see [NPU Support](../Pipeline_Usage/GPU_support.md#ascend-npu). When using Ascend NPU, please replace `"cuda"` with `"npu"` in your Python code. For details, see [NPU Support](../Pipeline_Usage/GPU_support.md#ascend-npu).

View File

@@ -42,6 +42,8 @@ This section introduces the Diffusion models supported by `DiffSynth-Studio`. So
* [Qwen-Image](./Model_Details/Qwen-Image.md) * [Qwen-Image](./Model_Details/Qwen-Image.md)
* [FLUX.2](./Model_Details/FLUX2.md) * [FLUX.2](./Model_Details/FLUX2.md)
* [Z-Image](./Model_Details/Z-Image.md) * [Z-Image](./Model_Details/Z-Image.md)
* [Anima](./Model_Details/Anima.md)
* [LTX-2](./Model_Details/LTX-2.md)
## Section 3: Training Framework ## Section 3: Training Framework

View File

@@ -27,6 +27,8 @@ Welcome to DiffSynth-Studio's Documentation
Model_Details/Qwen-Image Model_Details/Qwen-Image
Model_Details/FLUX2 Model_Details/FLUX2
Model_Details/Z-Image Model_Details/Z-Image
Model_Details/Anima
Model_Details/LTX-2
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2

View File

@@ -0,0 +1,139 @@
# Anima
Anima 是由 CircleStone Labs 与 Comfy Org 训练并开源的图像生成模型。
## 安装
在使用本项目进行模型推理和训练前,请先安装 DiffSynth-Studio。
```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
更多关于安装的信息,请参考[安装依赖](../Pipeline_Usage/Setup.md)。
## 快速开始
运行以下代码可以快速加载 [circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima) 模型并进行推理。显存管理已启动,框架会自动根据剩余显存控制模型参数的加载,最低 8G 显存即可运行。
```python
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
vram_config = {
"offload_dtype": "disk",
"offload_device": "disk",
"onload_dtype": "disk",
"onload_device": "disk",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")
```
## 模型总览
|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|-|-|-|-|-|-|-|
|[circlestone-labs/Anima](https://www.modelscope.cn/models/circlestone-labs/Anima)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_inference_low_vram/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/full/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_full/anima-preview.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/lora/anima-preview.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/validate_lora/anima-preview.py)|
特殊训练脚本:
* 差分 LoRA 训练:[doc](../Training/Differential_LoRA.md)
* FP8 精度训练:[doc](../Training/FP8_Precision.md)
* 两阶段拆分训练:[doc](../Training/Split_Training.md)
* 端到端直接蒸馏:[doc](../Training/Direct_Distill.md)
## 模型推理
模型通过 `AnimaImagePipeline.from_pretrained` 加载,详见[加载模型](../Pipeline_Usage/Model_Inference.md#加载模型)。
`AnimaImagePipeline` 推理的输入参数包括:
* `prompt`: 提示词,描述画面中出现的内容。
* `negative_prompt`: 负向提示词,描述画面中不应该出现的内容,默认值为 `""`
* `cfg_scale`: Classifier-free guidance 的参数,默认值为 4.0。
* `input_image`: 输入图像,用于图像到图像的生成。默认为 `None`
* `denoising_strength`: 去噪强度,控制生成图像与输入图像的相似度,默认值为 1.0。
* `height`: 图像高度,需保证高度为 16 的倍数,默认值为 1024。
* `width`: 图像宽度,需保证宽度为 16 的倍数,默认值为 1024。
* `seed`: 随机种子。默认为 `None`,即完全随机。
* `rand_device`: 生成随机高斯噪声矩阵的计算设备,默认为 `"cpu"`。当设置为 `cuda` 时,在不同 GPU 上会导致不同的生成结果。
* `num_inference_steps`: 推理次数,默认值为 30。
* `sigma_shift`: 调度器的 sigma 偏移量,默认为 `None`
* `progress_bar_cmd`: 进度条,默认为 `tqdm.tqdm`。可通过设置为 `lambda x:x` 来屏蔽进度条。
如果显存不足,请开启[显存管理](../Pipeline_Usage/VRAM_management.md),我们在示例代码中提供了每个模型推荐的低显存配置,详见前文"模型总览"中的表格。
## 模型训练
Anima 系列模型统一通过 [`examples/anima/model_training/train.py`](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/anima/model_training/train.py) 进行训练,脚本的参数包括:
* 通用训练参数
* 数据集基础配置
* `--dataset_base_path`: 数据集的根目录。
* `--dataset_metadata_path`: 数据集的元数据文件路径。
* `--dataset_repeat`: 每个 epoch 中数据集重复的次数。
* `--dataset_num_workers`: 每个 Dataloder 的进程数量。
* `--data_file_keys`: 元数据中需要加载的字段名称,通常是图像或视频文件的路径,以 `,` 分隔。
* 模型加载配置
* `--model_paths`: 要加载的模型路径。JSON 格式。
* `--model_id_with_origin_paths`: 带原始路径的模型 ID例如 `"anima-team/anima-1B:text_encoder/*.safetensors"`。用逗号分隔。
* `--extra_inputs`: 模型 Pipeline 所需的额外输入参数,例如训练 ControlNet 模型时需要额外参数 `controlnet_inputs`,以 `,` 分隔。
* `--fp8_models`:以 FP8 格式加载的模型,格式与 `--model_paths``--model_id_with_origin_paths` 一致,目前仅支持参数不被梯度更新的模型(不需要梯度回传,或梯度仅更新其 LoRA
* 训练基础配置
* `--learning_rate`: 学习率。
* `--num_epochs`: 轮数Epoch
* `--trainable_models`: 可训练的模型,例如 `dit``vae``text_encoder`
* `--find_unused_parameters`: DDP 训练中是否存在未使用的参数,少数模型包含不参与梯度计算的冗余参数,需开启这一设置避免在多 GPU 训练中报错。
* `--weight_decay`:权重衰减大小,详见 [torch.optim.AdamW](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html)。
* `--task`: 训练任务,默认为 `sft`,部分模型支持更多训练模式,请参考每个特定模型的文档。
* 输出配置
* `--output_path`: 模型保存路径。
* `--remove_prefix_in_ckpt`: 在模型文件的 state dict 中移除前缀。
* `--save_steps`: 保存模型的训练步数间隔,若此参数留空,则每个 epoch 保存一次。
* LoRA 配置
* `--lora_base_model`: LoRA 添加到哪个模型上。
* `--lora_target_modules`: LoRA 添加到哪些层上。
* `--lora_rank`: LoRA 的秩Rank
* `--lora_checkpoint`: LoRA 检查点的路径。如果提供此路径LoRA 将从此检查点加载。
* `--preset_lora_path`: 预置 LoRA 检查点路径,如果提供此路径,这一 LoRA 将会以融入基础模型的形式加载。此参数用于 LoRA 差分训练。
* `--preset_lora_model`: 预置 LoRA 融入的模型,例如 `dit`
* 梯度配置
* `--use_gradient_checkpointing`: 是否启用 gradient checkpointing。
* `--use_gradient_checkpointing_offload`: 是否将 gradient checkpointing 卸载到内存中。
* `--gradient_accumulation_steps`: 梯度累积步数。
* 图像宽高配置(适用于图像生成模型和视频生成模型)
* `--height`: 图像或视频的高度。将 `height``width` 留空以启用动态分辨率。
* `--width`: 图像或视频的宽度。将 `height``width` 留空以启用动态分辨率。
* `--max_pixels`: 图像或视频帧的最大像素面积,当启用动态分辨率时,分辨率大于这个数值的图片都会被缩小,分辨率小于这个数值的图片保持不变。
* Anima 专有参数
* `--tokenizer_path`: tokenizer 的路径,适用于文生图模型,留空则自动从远程下载。
* `--tokenizer_t5xxl_path`: T5-XXL tokenizer 的路径,适用于文生图模型,留空则自动从远程下载。
我们构建了一个样例图像数据集,以方便您进行测试,通过以下命令可以下载这个数据集:
```shell
modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
```
我们为每个模型编写了推荐的训练脚本,请参考前文"模型总览"中的表格。关于如何编写模型训练脚本,请参考[模型训练](../Pipeline_Usage/Model_Training.md);更多高阶训练算法,请参考[训练框架详解](https://github.com/modelscope/DiffSynth-Studio/tree/main/docs/zh/Training/)。

View File

@@ -112,6 +112,8 @@ write_video_audio_ltx2(
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证| |模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|-|-|-|-|-|-|-|-| |-|-|-|-|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)| |[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-| |[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-| |[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-| |[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|

View File

@@ -90,3 +90,4 @@ export CPU_AFFINITY_CONF=1
|-----------|------|-------------------| |-----------|------|-------------------|
| Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 | | Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 |
| Qwen-Image系列 | --initialize_model_on_cpu | 模型需要在cpu上进行初始化 | | Qwen-Image系列 | --initialize_model_on_cpu | 模型需要在cpu上进行初始化 |
| Z-Image 系列 | --enable_npu_patch | 使用NPU融合算子来替换Z-image模型中的对应算子以提升模型在NPU上的性能 |

View File

@@ -37,9 +37,9 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6
git clone https://github.com/modelscope/DiffSynth-Studio.git git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio cd DiffSynth-Studio
# aarch64/ARM # aarch64/ARM
pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu" pip install -e .[npu_aarch64]
# x86 # x86
pip install -e .[npu] pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu"
使用 Ascend NPU 时,请将 Python 代码中的 `"cuda"` 改为 `"npu"`,详见[NPU 支持](../Pipeline_Usage/GPU_support.md#ascend-npu)。 使用 Ascend NPU 时,请将 Python 代码中的 `"cuda"` 改为 `"npu"`,详见[NPU 支持](../Pipeline_Usage/GPU_support.md#ascend-npu)。

View File

@@ -42,6 +42,8 @@ graph LR;
* [Qwen-Image](./Model_Details/Qwen-Image.md) * [Qwen-Image](./Model_Details/Qwen-Image.md)
* [FLUX.2](./Model_Details/FLUX2.md) * [FLUX.2](./Model_Details/FLUX2.md)
* [Z-Image](./Model_Details/Z-Image.md) * [Z-Image](./Model_Details/Z-Image.md)
* [Anima](./Model_Details/Anima.md)
* [LTX-2](./Model_Details/LTX-2.md)
## Section 3: 训练框架 ## Section 3: 训练框架

View File

@@ -27,6 +27,8 @@
Model_Details/Qwen-Image Model_Details/Qwen-Image
Model_Details/FLUX2 Model_Details/FLUX2
Model_Details/Z-Image Model_Details/Z-Image
Model_Details/Anima
Model_Details/LTX-2
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2

View File

@@ -0,0 +1,19 @@
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors"),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/")
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")

View File

@@ -0,0 +1,30 @@
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
vram_config = {
"offload_dtype": "disk",
"offload_device": "disk",
"onload_dtype": "disk",
"onload_device": "disk",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
image = pipe(prompt, seed=0, num_inference_steps=50)
image.save("image.jpg")

View File

@@ -0,0 +1,14 @@
accelerate launch examples/anima/model_training/train.py \
--dataset_base_path data/example_image_dataset \
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 50 \
--model_id_with_origin_paths "circlestone-labs/Anima:split_files/diffusion_models/anima-preview.safetensors,circlestone-labs/Anima:split_files/text_encoders/qwen_3_06b_base.safetensors,circlestone-labs/Anima:split_files/vae/qwen_image_vae.safetensors" \
--tokenizer_path "Qwen/Qwen3-0.6B:./" \
--tokenizer_t5xxl_path "stabilityai/stable-diffusion-3.5-large:tokenizer_3/" \
--learning_rate 1e-5 \
--num_epochs 2 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/anima-preview_full" \
--trainable_models "dit" \
--use_gradient_checkpointing

View File

@@ -0,0 +1,16 @@
accelerate launch examples/anima/model_training/train.py \
--dataset_base_path data/example_image_dataset \
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 50 \
--model_id_with_origin_paths "circlestone-labs/Anima:split_files/diffusion_models/anima-preview.safetensors,circlestone-labs/Anima:split_files/text_encoders/qwen_3_06b_base.safetensors,circlestone-labs/Anima:split_files/vae/qwen_image_vae.safetensors" \
--tokenizer_path "Qwen/Qwen3-0.6B:./" \
--tokenizer_t5xxl_path "stabilityai/stable-diffusion-3.5-large:tokenizer_3/" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/anima-preview_lora" \
--lora_base_model "dit" \
--lora_target_modules "" \
--lora_rank 32 \
--use_gradient_checkpointing

View File

@@ -0,0 +1,145 @@
import torch, os, argparse, accelerate
from diffsynth.core import UnifiedDataset
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
from diffsynth.diffusion import *
os.environ["TOKENIZERS_PARALLELISM"] = "false"
class AnimaTrainingModule(DiffusionTrainingModule):
def __init__(
self,
model_paths=None, model_id_with_origin_paths=None,
tokenizer_path=None, tokenizer_t5xxl_path=None,
trainable_models=None,
lora_base_model=None, lora_target_modules="", lora_rank=32, lora_checkpoint=None,
preset_lora_path=None, preset_lora_model=None,
use_gradient_checkpointing=True,
use_gradient_checkpointing_offload=False,
extra_inputs=None,
fp8_models=None,
offload_models=None,
device="cpu",
task="sft",
):
super().__init__()
# Load models
model_configs = self.parse_model_configs(model_paths, model_id_with_origin_paths, fp8_models=fp8_models, offload_models=offload_models, device=device)
tokenizer_config = self.parse_path_or_model_id(tokenizer_path, ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"))
tokenizer_t5xxl_config = self.parse_path_or_model_id(tokenizer_t5xxl_path, ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"))
self.pipe = AnimaImagePipeline.from_pretrained(torch_dtype=torch.bfloat16, device=device, model_configs=model_configs, tokenizer_config=tokenizer_config, tokenizer_t5xxl_config=tokenizer_t5xxl_config)
self.pipe = self.split_pipeline_units(task, self.pipe, trainable_models, lora_base_model)
# Training mode
self.switch_pipe_to_training_mode(
self.pipe, trainable_models,
lora_base_model, lora_target_modules, lora_rank, lora_checkpoint,
preset_lora_path, preset_lora_model,
task=task,
)
# Other configs
self.use_gradient_checkpointing = use_gradient_checkpointing
self.use_gradient_checkpointing_offload = use_gradient_checkpointing_offload
self.extra_inputs = extra_inputs.split(",") if extra_inputs is not None else []
self.fp8_models = fp8_models
self.task = task
self.task_to_loss = {
"sft:data_process": lambda pipe, *args: args,
"direct_distill:data_process": lambda pipe, *args: args,
"sft": lambda pipe, inputs_shared, inputs_posi, inputs_nega: FlowMatchSFTLoss(pipe, **inputs_shared, **inputs_posi),
"sft:train": lambda pipe, inputs_shared, inputs_posi, inputs_nega: FlowMatchSFTLoss(pipe, **inputs_shared, **inputs_posi),
"direct_distill": lambda pipe, inputs_shared, inputs_posi, inputs_nega: DirectDistillLoss(pipe, **inputs_shared, **inputs_posi),
"direct_distill:train": lambda pipe, inputs_shared, inputs_posi, inputs_nega: DirectDistillLoss(pipe, **inputs_shared, **inputs_posi),
}
def get_pipeline_inputs(self, data):
inputs_posi = {"prompt": data["prompt"]}
inputs_nega = {"negative_prompt": ""}
inputs_shared = {
# Assume you are using this pipeline for inference,
# please fill in the input parameters.
"input_image": data["image"],
"height": data["image"].size[1],
"width": data["image"].size[0],
# Please do not modify the following parameters
# unless you clearly know what this will cause.
"cfg_scale": 1,
"rand_device": self.pipe.device,
"use_gradient_checkpointing": self.use_gradient_checkpointing,
"use_gradient_checkpointing_offload": self.use_gradient_checkpointing_offload,
}
inputs_shared = self.parse_extra_inputs(data, self.extra_inputs, inputs_shared)
return inputs_shared, inputs_posi, inputs_nega
def forward(self, data, inputs=None):
if inputs is None: inputs = self.get_pipeline_inputs(data)
inputs = self.transfer_data_to_device(inputs, self.pipe.device, self.pipe.torch_dtype)
for unit in self.pipe.units:
inputs = self.pipe.unit_runner(unit, self.pipe, *inputs)
loss = self.task_to_loss[self.task](self.pipe, *inputs)
return loss
def anima_parser():
parser = argparse.ArgumentParser(description="Training script for Anima models.")
parser = add_general_config(parser)
parser = add_image_size_config(parser)
parser.add_argument("--tokenizer_path", type=str, default=None, help="Path to tokenizer.")
parser.add_argument("--tokenizer_t5xxl_path", type=str, default=None, help="Path to tokenizer_t5xxl.")
return parser
if __name__ == "__main__":
parser = anima_parser()
args = parser.parse_args()
accelerator = accelerate.Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)],
)
dataset = UnifiedDataset(
base_path=args.dataset_base_path,
metadata_path=args.dataset_metadata_path,
repeat=args.dataset_repeat,
data_file_keys=args.data_file_keys.split(","),
main_data_operator=UnifiedDataset.default_image_operator(
base_path=args.dataset_base_path,
max_pixels=args.max_pixels,
height=args.height,
width=args.width,
height_division_factor=16,
width_division_factor=16,
)
)
model = AnimaTrainingModule(
model_paths=args.model_paths,
model_id_with_origin_paths=args.model_id_with_origin_paths,
tokenizer_path=args.tokenizer_path,
tokenizer_t5xxl_path=args.tokenizer_t5xxl_path,
trainable_models=args.trainable_models,
lora_base_model=args.lora_base_model,
lora_target_modules=args.lora_target_modules,
lora_rank=args.lora_rank,
lora_checkpoint=args.lora_checkpoint,
preset_lora_path=args.preset_lora_path,
preset_lora_model=args.preset_lora_model,
use_gradient_checkpointing=args.use_gradient_checkpointing,
use_gradient_checkpointing_offload=args.use_gradient_checkpointing_offload,
extra_inputs=args.extra_inputs,
fp8_models=args.fp8_models,
offload_models=args.offload_models,
task=args.task,
device=accelerator.device,
)
model_logger = ModelLogger(
args.output_path,
remove_prefix_in_ckpt=args.remove_prefix_in_ckpt,
)
launcher_map = {
"sft:data_process": launch_data_process_task,
"direct_distill:data_process": launch_data_process_task,
"sft": launch_training_task,
"sft:train": launch_training_task,
"direct_distill": launch_training_task,
"direct_distill:train": launch_training_task,
}
launcher_map[args.task](accelerator, dataset, model, model_logger, args=args)

View File

@@ -0,0 +1,21 @@
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
from diffsynth.core import load_state_dict
import torch
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors"),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/")
)
state_dict = load_state_dict("./models/train/anima-preview_full/epoch-1.safetensors", torch_dtype=torch.bfloat16)
pipe.dit.load_state_dict(state_dict)
prompt = "a dog"
image = pipe(prompt=prompt, seed=0)
image.save("image.jpg")

View File

@@ -0,0 +1,19 @@
from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig
import torch
pipe = AnimaImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors"),
ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors"),
],
tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/")
)
pipe.load_lora(pipe.dit, "./models/train/anima-preview_lora/epoch-4.safetensors")
prompt = "a dog"
image = pipe(prompt=prompt, seed=0)
image.save("image.jpg")

View File

@@ -46,7 +46,6 @@ negative_prompt = (
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
) )
height, width, num_frames = 512 * 2, 768 * 2, 121 height, width, num_frames = 512 * 2, 768 * 2, 121
height, width, num_frames = 512 * 2, 768 * 2, 121
dataset_snapshot_download( dataset_snapshot_download(
dataset_id="DiffSynth-Studio/examples_in_diffsynth", dataset_id="DiffSynth-Studio/examples_in_diffsynth",
local_dir="./", local_dir="./",

View File

@@ -0,0 +1,77 @@
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
from diffsynth.utils.data import VideoData
from modelscope import dataset_snapshot_download
vram_config = {
"offload_dtype": torch.bfloat16,
"offload_device": "cpu",
"onload_dtype": torch.bfloat16,
"onload_device": "cuda",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
)
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Detailer", origin_file_pattern="ltx-2-19b-ic-lora-detailer.safetensors"))
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
negative_prompt = (
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)
height, width, num_frames = 512 * 2, 768 * 2, 121
ref_scale_factor = 1
frame_rate = 24
# the frame rate of the video should better be the same with the reference video
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
input_video = VideoData("data/example_video_dataset/ltx2/video1.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
input_video = input_video.raw_data()
video, audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
seed=43,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
in_context_videos=[input_video],
in_context_downsample_factor=ref_scale_factor,
tiled=True,
use_two_stage_pipeline=True,
clear_lora_before_state_two=True,
)
write_video_audio_ltx2(
video=video,
audio=audio,
output_path='ltx2_twostage_iclora.mp4',
fps=frame_rate,
audio_sample_rate=24000,
)

View File

@@ -0,0 +1,77 @@
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
from diffsynth.utils.data import VideoData
from modelscope import dataset_snapshot_download
vram_config = {
"offload_dtype": torch.bfloat16,
"offload_device": "cpu",
"onload_dtype": torch.bfloat16,
"onload_device": "cuda",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
)
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Union-Control", origin_file_pattern="ltx-2-19b-ic-lora-union-control-ref0.5.safetensors"))
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
negative_prompt = (
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)
height, width, num_frames = 512 * 2, 768 * 2, 121
ref_scale_factor = 2
frame_rate = 24
# the frame rate of the video should better be the same with the reference video
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
input_video = input_video.raw_data()
video, audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
seed=43,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
in_context_videos=[input_video],
in_context_downsample_factor=ref_scale_factor,
tiled=True,
use_two_stage_pipeline=True,
clear_lora_before_state_two=True,
)
write_video_audio_ltx2(
video=video,
audio=audio,
output_path='ltx2_twostage_iclora.mp4',
fps=frame_rate,
audio_sample_rate=24000,
)

View File

@@ -0,0 +1,77 @@
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
from diffsynth.utils.data import VideoData
from modelscope import dataset_snapshot_download
vram_config = {
"offload_dtype": torch.float8_e5m2,
"offload_device": "cpu",
"onload_dtype": torch.float8_e5m2,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e5m2,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
)
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Detailer", origin_file_pattern="ltx-2-19b-ic-lora-detailer.safetensors"))
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
negative_prompt = (
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)
height, width, num_frames = 512 * 2, 768 * 2, 121
ref_scale_factor = 1
frame_rate = 24
# the frame rate of the video should better be the same with the reference video
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
input_video = VideoData("data/example_video_dataset/ltx2/video1.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
input_video = input_video.raw_data()
video, audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
seed=43,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
in_context_videos=[input_video],
in_context_downsample_factor=ref_scale_factor,
tiled=True,
use_two_stage_pipeline=True,
clear_lora_before_state_two=True,
)
write_video_audio_ltx2(
video=video,
audio=audio,
output_path='ltx2_twostage_iclora.mp4',
fps=frame_rate,
audio_sample_rate=24000,
)

View File

@@ -0,0 +1,77 @@
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
from diffsynth.utils.data import VideoData
from modelscope import dataset_snapshot_download
vram_config = {
"offload_dtype": torch.float8_e5m2,
"offload_device": "cpu",
"onload_dtype": torch.float8_e5m2,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e5m2,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
)
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Union-Control", origin_file_pattern="ltx-2-19b-ic-lora-union-control-ref0.5.safetensors"))
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
negative_prompt = (
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)
height, width, num_frames = 512 * 2, 768 * 2, 121
ref_scale_factor = 2
frame_rate = 24
# the frame rate of the video should better be the same with the reference video
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
input_video = input_video.raw_data()
video, audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
seed=43,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
in_context_videos=[input_video],
in_context_downsample_factor=ref_scale_factor,
tiled=True,
use_two_stage_pipeline=True,
clear_lora_before_state_two=True,
)
write_video_audio_ltx2(
video=video,
audio=audio,
output_path='ltx2_twostage_iclora.mp4',
fps=frame_rate,
audio_sample_rate=24000,
)

View File

@@ -6,7 +6,7 @@ accelerate launch examples/ltx2/model_training/train.py \
--extra_inputs "input_audio" \ --extra_inputs "input_audio" \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49 \ --num_frames 121 \
--dataset_repeat 1 \ --dataset_repeat 1 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \
@@ -23,7 +23,7 @@ accelerate launch --config_file examples/qwen_image/model_training/full/accelera
--extra_inputs "input_audio" \ --extra_inputs "input_audio" \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49 \ --num_frames 121 \
--dataset_repeat 100 \ --dataset_repeat 100 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \

View File

@@ -0,0 +1,39 @@
# Splited Training
accelerate launch examples/ltx2/model_training/train.py \
--dataset_base_path data/example_video_dataset/ltx2 \
--dataset_metadata_path data/example_video_dataset/ltx2_t2av_iclora.json \
--data_file_keys "video,input_audio,in_context_videos" \
--extra_inputs "input_audio,in_context_videos,in_context_downsample_factor,frame_rate" \
--height 512 \
--width 768 \
--num_frames 81 \
--dataset_repeat 1 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/LTX2-T2AV-IC-LoRA-splited-cache" \
--lora_base_model "dit" \
--lora_target_modules "to_k,to_q,to_v,to_out.0" \
--lora_rank 32 \
--use_gradient_checkpointing \
--task "sft:data_process"
accelerate launch examples/ltx2/model_training/train.py \
--dataset_base_path ./models/train/LTX2-T2AV-IC-LoRA-splited-cache \
--data_file_keys "video,input_audio,in_context_videos" \
--extra_inputs "input_audio,in_context_videos,in_context_downsample_factor,frame_rate" \
--height 512 \
--width 768 \
--num_frames 81 \
--dataset_repeat 100 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/LTX2-T2AV-IC-LoRA" \
--lora_base_model "dit" \
--lora_target_modules "to_k,to_q,to_v,to_out.0" \
--lora_rank 32 \
--use_gradient_checkpointing \
--task "sft:train"

View File

@@ -24,7 +24,7 @@ accelerate launch examples/ltx2/model_training/train.py \
--dataset_metadata_path data/example_video_dataset/ltx2_t2av.csv \ --dataset_metadata_path data/example_video_dataset/ltx2_t2av.csv \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49\ --num_frames 121\
--dataset_repeat 1 \ --dataset_repeat 1 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \
@@ -42,7 +42,7 @@ accelerate launch examples/ltx2/model_training/train.py \
--dataset_base_path ./models/train/LTX2-T2AV-noaudio_lora-splited-cache \ --dataset_base_path ./models/train/LTX2-T2AV-noaudio_lora-splited-cache \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49\ --num_frames 121\
--dataset_repeat 100 \ --dataset_repeat 100 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \

View File

@@ -27,7 +27,7 @@ accelerate launch examples/ltx2/model_training/train.py \
--extra_inputs "input_audio" \ --extra_inputs "input_audio" \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49 \ --num_frames 121 \
--dataset_repeat 1 \ --dataset_repeat 1 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \
@@ -46,7 +46,7 @@ accelerate launch examples/ltx2/model_training/train.py \
--extra_inputs "input_audio" \ --extra_inputs "input_audio" \
--height 512 \ --height 512 \
--width 768 \ --width 768 \
--num_frames 49 \ --num_frames 121 \
--dataset_repeat 100 \ --dataset_repeat 100 \
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \ --model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
--learning_rate 1e-4 \ --learning_rate 1e-4 \

View File

@@ -1,7 +1,6 @@
import torch, os, argparse, accelerate, warnings import torch, os, argparse, accelerate, warnings
from diffsynth.core import UnifiedDataset from diffsynth.core import UnifiedDataset
from diffsynth.core.data.operators import LoadAudioWithTorchaudio, ToAbsolutePath from diffsynth.core.data.operators import LoadAudioWithTorchaudio, ToAbsolutePath, RouteByType, SequencialProcess
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.diffusion import * from diffsynth.diffusion import *
os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["TOKENIZERS_PARALLELISM"] = "false"
@@ -69,6 +68,7 @@ class LTX2TrainingModule(DiffusionTrainingModule):
"height": data["video"][0].size[1], "height": data["video"][0].size[1],
"width": data["video"][0].size[0], "width": data["video"][0].size[0],
"num_frames": len(data["video"]), "num_frames": len(data["video"]),
"frame_rate": data.get("frame_rate", 24),
# Please do not modify the following parameters # Please do not modify the following parameters
# unless you clearly know what this will cause. # unless you clearly know what this will cause.
"cfg_scale": 1, "cfg_scale": 1,
@@ -108,24 +108,29 @@ if __name__ == "__main__":
gradient_accumulation_steps=args.gradient_accumulation_steps, gradient_accumulation_steps=args.gradient_accumulation_steps,
kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)], kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)],
) )
video_processor = UnifiedDataset.default_video_operator(
base_path=args.dataset_base_path,
max_pixels=args.max_pixels,
height=args.height,
width=args.width,
height_division_factor=32,
width_division_factor=32,
num_frames=args.num_frames,
time_division_factor=8,
time_division_remainder=1,
)
dataset = UnifiedDataset( dataset = UnifiedDataset(
base_path=args.dataset_base_path, base_path=args.dataset_base_path,
metadata_path=args.dataset_metadata_path, metadata_path=args.dataset_metadata_path,
repeat=args.dataset_repeat, repeat=args.dataset_repeat,
data_file_keys=args.data_file_keys.split(","), data_file_keys=args.data_file_keys.split(","),
main_data_operator=UnifiedDataset.default_video_operator( main_data_operator=video_processor,
base_path=args.dataset_base_path,
max_pixels=args.max_pixels,
height=args.height,
width=args.width,
height_division_factor=16,
width_division_factor=16,
num_frames=args.num_frames,
time_division_factor=4,
time_division_remainder=1,
),
special_operator_map={ special_operator_map={
"input_audio": ToAbsolutePath(args.dataset_base_path) >> LoadAudioWithTorchaudio(duration=float(args.num_frames) / float(args.frame_rate)), "input_audio": ToAbsolutePath(args.dataset_base_path) >> LoadAudioWithTorchaudio(duration=float(args.num_frames) / float(args.frame_rate)),
"in_context_videos": RouteByType(operator_map=[
(str, video_processor),
(list, SequencialProcess(video_processor)),
]),
} }
) )
model = LTX2TrainingModule( model = LTX2TrainingModule(

View File

@@ -27,7 +27,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
) )
prompt = "A beautiful sunset over the ocean." prompt = "A beautiful sunset over the ocean."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
height, width, num_frames = 512, 768, 49 height, width, num_frames = 512, 768, 121
video, audio = pipe( video, audio = pipe(
prompt=prompt, prompt=prompt,
negative_prompt=negative_prompt, negative_prompt=negative_prompt,

View File

@@ -0,0 +1,56 @@
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
from diffsynth.utils.data import VideoData
vram_config = {
"offload_dtype": torch.bfloat16,
"offload_device": "cpu",
"onload_dtype": torch.bfloat16,
"onload_device": "cuda",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
],
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
)
pipe.load_lora(pipe.dit, "./models/train/LTX2-T2AV-IC-LoRA/epoch-4.safetensors")
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
height, width, num_frames = 512, 768, 81
ref_scale_factor = 2
frame_rate = 24
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
input_video = input_video.raw_data()
video, audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
seed=43,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
tiled=True,
in_context_videos=[input_video],
in_context_downsample_factor=ref_scale_factor,
)
write_video_audio_ltx2(
video=video,
audio=audio,
output_path='ltx2_onestage_ic.mp4',
fps=frame_rate,
audio_sample_rate=24000,
)

View File

@@ -28,7 +28,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV_lora/epoch-4.safetensors") pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV_lora/epoch-4.safetensors")
prompt = "A beautiful sunset over the ocean." prompt = "A beautiful sunset over the ocean."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
height, width, num_frames = 512, 768, 49 height, width, num_frames = 512, 768, 121
video, audio = pipe( video, audio = pipe(
prompt=prompt, prompt=prompt,
negative_prompt=negative_prompt, negative_prompt=negative_prompt,

View File

@@ -28,7 +28,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV-noaudio_lora/epoch-4.safetensors") pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV-noaudio_lora/epoch-4.safetensors")
prompt = "A beautiful sunset over the ocean." prompt = "A beautiful sunset over the ocean."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
height, width, num_frames = 512, 768, 49 height, width, num_frames = 512, 768, 121
video, audio = pipe( video, audio = pipe(
prompt=prompt, prompt=prompt,
negative_prompt=negative_prompt, negative_prompt=negative_prompt,

View File

@@ -18,7 +18,7 @@ lora = ModelConfig(
model_id="lightx2v/Qwen-Image-Edit-2511-Lightning", model_id="lightx2v/Qwen-Image-Edit-2511-Lightning",
origin_file_pattern="Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors" origin_file_pattern="Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors"
) )
pipe.load_lora(pipe.dit, lora, alpha=8/64) pipe.load_lora(pipe.dit, lora, alpha=1)
pipe.scheduler = FlowMatchScheduler("Qwen-Image-Lightning") pipe.scheduler = FlowMatchScheduler("Qwen-Image-Lightning")

View File

@@ -28,7 +28,7 @@ lora = ModelConfig(
model_id="lightx2v/Qwen-Image-Edit-2511-Lightning", model_id="lightx2v/Qwen-Image-Edit-2511-Lightning",
origin_file_pattern="Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors" origin_file_pattern="Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors"
) )
pipe.load_lora(pipe.dit, lora, alpha=8/64) pipe.load_lora(pipe.dit, lora, alpha=1)
pipe.scheduler = FlowMatchScheduler("Qwen-Image-Lightning") pipe.scheduler = FlowMatchScheduler("Qwen-Image-Lightning")

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "diffsynth" name = "diffsynth"
version = "2.0.4" version = "2.0.5"
description = "Enjoy the magic of Diffusion models!" description = "Enjoy the magic of Diffusion models!"
authors = [{name = "ModelScope Team"}] authors = [{name = "ModelScope Team"}]
license = {text = "Apache-2.0"} license = {text = "Apache-2.0"}