mirror of
https://github.com/modelscope/DiffSynth-Studio.git
synced 2026-03-19 14:58:12 +00:00
Compare commits
20 Commits
cache_lear
...
docs2.0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b3f6c3275f | ||
|
|
29cd5c7612 | ||
|
|
ff4be1c7c7 | ||
|
|
6b0fb1601f | ||
|
|
4b400c07eb | ||
|
|
6a6ae6d791 | ||
|
|
1a380a6b62 | ||
|
|
5ca74923e8 | ||
|
|
8b9a094c1b | ||
|
|
5996c2b068 | ||
|
|
8fc7e005a6 | ||
|
|
a18966c300 | ||
|
|
96143aa26b | ||
|
|
71cea4371c | ||
|
|
fc11fd4297 | ||
|
|
bd3c5822a1 | ||
|
|
96fb0f3afe | ||
|
|
b68663426f | ||
|
|
0e6976a0ae | ||
|
|
6383ec358c |
@@ -645,6 +645,8 @@ Example code for LTX-2 is available at: [/examples/ltx2/](/examples/ltx2/)
|
|||||||
| Model ID | Extra Args | Inference | Low-VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
|
| Model ID | Extra Args | Inference | Low-VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
|
||||||
|-|-|-|-|-|-|-|-|
|
|-|-|-|-|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
||||||
|
|||||||
@@ -645,6 +645,8 @@ LTX-2 的示例代码位于:[/examples/ltx2/](/examples/ltx2/)
|
|||||||
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|
||||||
|-|-|-|-|-|-|-|-|
|
|-|-|-|-|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
||||||
|
|||||||
@@ -94,20 +94,23 @@ class BasePipeline(torch.nn.Module):
|
|||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
def check_resize_height_width(self, height, width, num_frames=None):
|
def check_resize_height_width(self, height, width, num_frames=None, verbose=1):
|
||||||
# Shape check
|
# Shape check
|
||||||
if height % self.height_division_factor != 0:
|
if height % self.height_division_factor != 0:
|
||||||
height = (height + self.height_division_factor - 1) // self.height_division_factor * self.height_division_factor
|
height = (height + self.height_division_factor - 1) // self.height_division_factor * self.height_division_factor
|
||||||
print(f"height % {self.height_division_factor} != 0. We round it up to {height}.")
|
if verbose > 0:
|
||||||
|
print(f"height % {self.height_division_factor} != 0. We round it up to {height}.")
|
||||||
if width % self.width_division_factor != 0:
|
if width % self.width_division_factor != 0:
|
||||||
width = (width + self.width_division_factor - 1) // self.width_division_factor * self.width_division_factor
|
width = (width + self.width_division_factor - 1) // self.width_division_factor * self.width_division_factor
|
||||||
print(f"width % {self.width_division_factor} != 0. We round it up to {width}.")
|
if verbose > 0:
|
||||||
|
print(f"width % {self.width_division_factor} != 0. We round it up to {width}.")
|
||||||
if num_frames is None:
|
if num_frames is None:
|
||||||
return height, width
|
return height, width
|
||||||
else:
|
else:
|
||||||
if num_frames % self.time_division_factor != self.time_division_remainder:
|
if num_frames % self.time_division_factor != self.time_division_remainder:
|
||||||
num_frames = (num_frames + self.time_division_factor - 1) // self.time_division_factor * self.time_division_factor + self.time_division_remainder
|
num_frames = (num_frames + self.time_division_factor - 1) // self.time_division_factor * self.time_division_factor + self.time_division_remainder
|
||||||
print(f"num_frames % {self.time_division_factor} != {self.time_division_remainder}. We round it up to {num_frames}.")
|
if verbose > 0:
|
||||||
|
print(f"num_frames % {self.time_division_factor} != {self.time_division_remainder}. We round it up to {num_frames}.")
|
||||||
return height, width, num_frames
|
return height, width, num_frames
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -121,7 +121,9 @@ class TrajectoryImitationLoss(torch.nn.Module):
|
|||||||
progress_id_teacher = torch.argmin((timesteps_teacher - pipe.scheduler.timesteps[progress_id + 1]).abs())
|
progress_id_teacher = torch.argmin((timesteps_teacher - pipe.scheduler.timesteps[progress_id + 1]).abs())
|
||||||
latents_ = trajectory_teacher[progress_id_teacher]
|
latents_ = trajectory_teacher[progress_id_teacher]
|
||||||
|
|
||||||
target = (latents_ - inputs_shared["latents"]) / (sigma_ - sigma)
|
denom = sigma_ - sigma
|
||||||
|
denom = torch.sign(denom) * torch.clamp(denom.abs(), min=1e-6)
|
||||||
|
target = (latents_ - inputs_shared["latents"]) / denom
|
||||||
loss = loss + torch.nn.functional.mse_loss(noise_pred.float(), target.float()) * pipe.scheduler.training_weight(timestep)
|
loss = loss + torch.nn.functional.mse_loss(noise_pred.float(), target.float()) * pipe.scheduler.training_weight(timestep)
|
||||||
return loss
|
return loss
|
||||||
|
|
||||||
|
|||||||
@@ -8,6 +8,7 @@ import torch
|
|||||||
from einops import rearrange
|
from einops import rearrange
|
||||||
from .ltx2_common import rms_norm, Modality
|
from .ltx2_common import rms_norm, Modality
|
||||||
from ..core.attention.attention import attention_forward
|
from ..core.attention.attention import attention_forward
|
||||||
|
from ..core import gradient_checkpoint_forward
|
||||||
|
|
||||||
|
|
||||||
def get_timestep_embedding(
|
def get_timestep_embedding(
|
||||||
@@ -1352,28 +1353,21 @@ class LTXModel(torch.nn.Module):
|
|||||||
video: TransformerArgs | None,
|
video: TransformerArgs | None,
|
||||||
audio: TransformerArgs | None,
|
audio: TransformerArgs | None,
|
||||||
perturbations: BatchedPerturbationConfig,
|
perturbations: BatchedPerturbationConfig,
|
||||||
|
use_gradient_checkpointing: bool = False,
|
||||||
|
use_gradient_checkpointing_offload: bool = False,
|
||||||
) -> tuple[TransformerArgs, TransformerArgs]:
|
) -> tuple[TransformerArgs, TransformerArgs]:
|
||||||
"""Process transformer blocks for LTXAV."""
|
"""Process transformer blocks for LTXAV."""
|
||||||
|
|
||||||
# Process transformer blocks
|
# Process transformer blocks
|
||||||
for block in self.transformer_blocks:
|
for block in self.transformer_blocks:
|
||||||
if self._enable_gradient_checkpointing and self.training:
|
video, audio = gradient_checkpoint_forward(
|
||||||
# Use gradient checkpointing to save memory during training.
|
block,
|
||||||
# With use_reentrant=False, we can pass dataclasses directly -
|
use_gradient_checkpointing,
|
||||||
# PyTorch will track all tensor leaves in the computation graph.
|
use_gradient_checkpointing_offload,
|
||||||
video, audio = torch.utils.checkpoint.checkpoint(
|
video=video,
|
||||||
block,
|
audio=audio,
|
||||||
video,
|
perturbations=perturbations,
|
||||||
audio,
|
)
|
||||||
perturbations,
|
|
||||||
use_reentrant=False,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
video, audio = block(
|
|
||||||
video=video,
|
|
||||||
audio=audio,
|
|
||||||
perturbations=perturbations,
|
|
||||||
)
|
|
||||||
|
|
||||||
return video, audio
|
return video, audio
|
||||||
|
|
||||||
@@ -1398,7 +1392,12 @@ class LTXModel(torch.nn.Module):
|
|||||||
return x
|
return x
|
||||||
|
|
||||||
def _forward(
|
def _forward(
|
||||||
self, video: Modality | None, audio: Modality | None, perturbations: BatchedPerturbationConfig
|
self,
|
||||||
|
video: Modality | None,
|
||||||
|
audio: Modality | None,
|
||||||
|
perturbations: BatchedPerturbationConfig,
|
||||||
|
use_gradient_checkpointing: bool = False,
|
||||||
|
use_gradient_checkpointing_offload: bool = False,
|
||||||
) -> tuple[torch.Tensor, torch.Tensor]:
|
) -> tuple[torch.Tensor, torch.Tensor]:
|
||||||
"""
|
"""
|
||||||
Forward pass for LTX models.
|
Forward pass for LTX models.
|
||||||
@@ -1417,6 +1416,8 @@ class LTXModel(torch.nn.Module):
|
|||||||
video=video_args,
|
video=video_args,
|
||||||
audio=audio_args,
|
audio=audio_args,
|
||||||
perturbations=perturbations,
|
perturbations=perturbations,
|
||||||
|
use_gradient_checkpointing=use_gradient_checkpointing,
|
||||||
|
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Process output
|
# Process output
|
||||||
@@ -1440,12 +1441,12 @@ class LTXModel(torch.nn.Module):
|
|||||||
)
|
)
|
||||||
return vx, ax
|
return vx, ax
|
||||||
|
|
||||||
def forward(self, video_latents, video_positions, video_context, video_timesteps, audio_latents, audio_positions, audio_context, audio_timesteps):
|
def forward(self, video_latents, video_positions, video_context, video_timesteps, audio_latents, audio_positions, audio_context, audio_timesteps, use_gradient_checkpointing=False, use_gradient_checkpointing_offload=False):
|
||||||
cross_pe_max_pos = None
|
cross_pe_max_pos = None
|
||||||
if self.model_type.is_video_enabled() and self.model_type.is_audio_enabled():
|
if self.model_type.is_video_enabled() and self.model_type.is_audio_enabled():
|
||||||
cross_pe_max_pos = max(self.positional_embedding_max_pos[0], self.audio_positional_embedding_max_pos[0])
|
cross_pe_max_pos = max(self.positional_embedding_max_pos[0], self.audio_positional_embedding_max_pos[0])
|
||||||
self._init_preprocessors(cross_pe_max_pos)
|
self._init_preprocessors(cross_pe_max_pos)
|
||||||
video = Modality(video_latents, video_timesteps, video_positions, video_context)
|
video = Modality(video_latents, video_timesteps, video_positions, video_context)
|
||||||
audio = Modality(audio_latents, audio_timesteps, audio_positions, audio_context) if audio_latents is not None else None
|
audio = Modality(audio_latents, audio_timesteps, audio_positions, audio_context) if audio_latents is not None else None
|
||||||
vx, ax = self._forward(video=video, audio=audio, perturbations=None)
|
vx, ax = self._forward(video=video, audio=audio, perturbations=None, use_gradient_checkpointing=use_gradient_checkpointing, use_gradient_checkpointing_offload=use_gradient_checkpointing_offload)
|
||||||
return vx, ax
|
return vx, ax
|
||||||
|
|||||||
@@ -469,7 +469,7 @@ class Down_ResidualBlock(nn.Module):
|
|||||||
def forward(self, x, feat_cache=None, feat_idx=[0]):
|
def forward(self, x, feat_cache=None, feat_idx=[0]):
|
||||||
x_copy = x.clone()
|
x_copy = x.clone()
|
||||||
for module in self.downsamples:
|
for module in self.downsamples:
|
||||||
x = module(x, feat_cache, feat_idx)
|
x, feat_cache, feat_idx = module(x, feat_cache, feat_idx)
|
||||||
|
|
||||||
return x + self.avg_shortcut(x_copy), feat_cache, feat_idx
|
return x + self.avg_shortcut(x_copy), feat_cache, feat_idx
|
||||||
|
|
||||||
@@ -506,10 +506,10 @@ class Up_ResidualBlock(nn.Module):
|
|||||||
def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False):
|
def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False):
|
||||||
x_main = x.clone()
|
x_main = x.clone()
|
||||||
for module in self.upsamples:
|
for module in self.upsamples:
|
||||||
x_main = module(x_main, feat_cache, feat_idx)
|
x_main, feat_cache, feat_idx = module(x_main, feat_cache, feat_idx)
|
||||||
if self.avg_shortcut is not None:
|
if self.avg_shortcut is not None:
|
||||||
x_shortcut = self.avg_shortcut(x, first_chunk)
|
x_shortcut = self.avg_shortcut(x, first_chunk)
|
||||||
return x_main + x_shortcut
|
return x_main + x_shortcut, feat_cache, feat_idx
|
||||||
else:
|
else:
|
||||||
return x_main, feat_cache, feat_idx
|
return x_main, feat_cache, feat_idx
|
||||||
|
|
||||||
|
|||||||
@@ -61,6 +61,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
LTX2AudioVideoUnit_InputAudioEmbedder(),
|
LTX2AudioVideoUnit_InputAudioEmbedder(),
|
||||||
LTX2AudioVideoUnit_InputVideoEmbedder(),
|
LTX2AudioVideoUnit_InputVideoEmbedder(),
|
||||||
LTX2AudioVideoUnit_InputImagesEmbedder(),
|
LTX2AudioVideoUnit_InputImagesEmbedder(),
|
||||||
|
LTX2AudioVideoUnit_InContextVideoEmbedder(),
|
||||||
]
|
]
|
||||||
self.model_fn = model_fn_ltx2
|
self.model_fn = model_fn_ltx2
|
||||||
|
|
||||||
@@ -105,6 +106,8 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
|
|
||||||
def stage2_denoise(self, inputs_shared, inputs_posi, inputs_nega, progress_bar_cmd=tqdm):
|
def stage2_denoise(self, inputs_shared, inputs_posi, inputs_nega, progress_bar_cmd=tqdm):
|
||||||
if inputs_shared["use_two_stage_pipeline"]:
|
if inputs_shared["use_two_stage_pipeline"]:
|
||||||
|
if inputs_shared.get("clear_lora_before_state_two", False):
|
||||||
|
self.clear_lora()
|
||||||
latent = self.video_vae_encoder.per_channel_statistics.un_normalize(inputs_shared["video_latents"])
|
latent = self.video_vae_encoder.per_channel_statistics.un_normalize(inputs_shared["video_latents"])
|
||||||
self.load_models_to_device('upsampler',)
|
self.load_models_to_device('upsampler',)
|
||||||
latent = self.upsampler(latent)
|
latent = self.upsampler(latent)
|
||||||
@@ -112,11 +115,17 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
self.scheduler.set_timesteps(special_case="stage2")
|
self.scheduler.set_timesteps(special_case="stage2")
|
||||||
inputs_shared.update({k.replace("stage2_", ""): v for k, v in inputs_shared.items() if k.startswith("stage2_")})
|
inputs_shared.update({k.replace("stage2_", ""): v for k, v in inputs_shared.items() if k.startswith("stage2_")})
|
||||||
denoise_mask_video = 1.0
|
denoise_mask_video = 1.0
|
||||||
|
# input image
|
||||||
if inputs_shared.get("input_images", None) is not None:
|
if inputs_shared.get("input_images", None) is not None:
|
||||||
latent, denoise_mask_video, initial_latents = self.apply_input_images_to_latents(
|
latent, denoise_mask_video, initial_latents = self.apply_input_images_to_latents(
|
||||||
latent, inputs_shared.pop("input_latents"), inputs_shared["input_images_indexes"],
|
latent, inputs_shared.pop("input_latents"), inputs_shared["input_images_indexes"],
|
||||||
inputs_shared["input_images_strength"], latent.clone())
|
inputs_shared["input_images_strength"], latent.clone())
|
||||||
inputs_shared.update({"input_latents_video": initial_latents, "denoise_mask_video": denoise_mask_video})
|
inputs_shared.update({"input_latents_video": initial_latents, "denoise_mask_video": denoise_mask_video})
|
||||||
|
# remove in-context video control in stage 2
|
||||||
|
inputs_shared.pop("in_context_video_latents", None)
|
||||||
|
inputs_shared.pop("in_context_video_positions", None)
|
||||||
|
|
||||||
|
# initialize latents for stage 2
|
||||||
inputs_shared["video_latents"] = self.scheduler.sigmas[0] * denoise_mask_video * inputs_shared[
|
inputs_shared["video_latents"] = self.scheduler.sigmas[0] * denoise_mask_video * inputs_shared[
|
||||||
"video_noise"] + (1 - self.scheduler.sigmas[0] * denoise_mask_video) * latent
|
"video_noise"] + (1 - self.scheduler.sigmas[0] * denoise_mask_video) * latent
|
||||||
inputs_shared["audio_latents"] = self.scheduler.sigmas[0] * inputs_shared["audio_noise"] + (
|
inputs_shared["audio_latents"] = self.scheduler.sigmas[0] * inputs_shared["audio_noise"] + (
|
||||||
@@ -145,11 +154,14 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
# Prompt
|
# Prompt
|
||||||
prompt: str,
|
prompt: str,
|
||||||
negative_prompt: Optional[str] = "",
|
negative_prompt: Optional[str] = "",
|
||||||
# Image-to-video
|
|
||||||
denoising_strength: float = 1.0,
|
denoising_strength: float = 1.0,
|
||||||
|
# Image-to-video
|
||||||
input_images: Optional[list[Image.Image]] = None,
|
input_images: Optional[list[Image.Image]] = None,
|
||||||
input_images_indexes: Optional[list[int]] = None,
|
input_images_indexes: Optional[list[int]] = None,
|
||||||
input_images_strength: Optional[float] = 1.0,
|
input_images_strength: Optional[float] = 1.0,
|
||||||
|
# In-Context Video Control
|
||||||
|
in_context_videos: Optional[list[list[Image.Image]]] = None,
|
||||||
|
in_context_downsample_factor: Optional[int] = 2,
|
||||||
# Randomness
|
# Randomness
|
||||||
seed: Optional[int] = None,
|
seed: Optional[int] = None,
|
||||||
rand_device: Optional[str] = "cpu",
|
rand_device: Optional[str] = "cpu",
|
||||||
@@ -157,6 +169,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
height: Optional[int] = 512,
|
height: Optional[int] = 512,
|
||||||
width: Optional[int] = 768,
|
width: Optional[int] = 768,
|
||||||
num_frames=121,
|
num_frames=121,
|
||||||
|
frame_rate=24,
|
||||||
# Classifier-free guidance
|
# Classifier-free guidance
|
||||||
cfg_scale: Optional[float] = 3.0,
|
cfg_scale: Optional[float] = 3.0,
|
||||||
# Scheduler
|
# Scheduler
|
||||||
@@ -169,6 +182,7 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
tile_overlap_in_frames: Optional[int] = 24,
|
tile_overlap_in_frames: Optional[int] = 24,
|
||||||
# Special Pipelines
|
# Special Pipelines
|
||||||
use_two_stage_pipeline: Optional[bool] = False,
|
use_two_stage_pipeline: Optional[bool] = False,
|
||||||
|
clear_lora_before_state_two: Optional[bool] = False,
|
||||||
use_distilled_pipeline: Optional[bool] = False,
|
use_distilled_pipeline: Optional[bool] = False,
|
||||||
# progress_bar
|
# progress_bar
|
||||||
progress_bar_cmd=tqdm,
|
progress_bar_cmd=tqdm,
|
||||||
@@ -185,12 +199,13 @@ class LTX2AudioVideoPipeline(BasePipeline):
|
|||||||
}
|
}
|
||||||
inputs_shared = {
|
inputs_shared = {
|
||||||
"input_images": input_images, "input_images_indexes": input_images_indexes, "input_images_strength": input_images_strength,
|
"input_images": input_images, "input_images_indexes": input_images_indexes, "input_images_strength": input_images_strength,
|
||||||
|
"in_context_videos": in_context_videos, "in_context_downsample_factor": in_context_downsample_factor,
|
||||||
"seed": seed, "rand_device": rand_device,
|
"seed": seed, "rand_device": rand_device,
|
||||||
"height": height, "width": width, "num_frames": num_frames,
|
"height": height, "width": width, "num_frames": num_frames, "frame_rate": frame_rate,
|
||||||
"cfg_scale": cfg_scale,
|
"cfg_scale": cfg_scale,
|
||||||
"tiled": tiled, "tile_size_in_pixels": tile_size_in_pixels, "tile_overlap_in_pixels": tile_overlap_in_pixels,
|
"tiled": tiled, "tile_size_in_pixels": tile_size_in_pixels, "tile_overlap_in_pixels": tile_overlap_in_pixels,
|
||||||
"tile_size_in_frames": tile_size_in_frames, "tile_overlap_in_frames": tile_overlap_in_frames,
|
"tile_size_in_frames": tile_size_in_frames, "tile_overlap_in_frames": tile_overlap_in_frames,
|
||||||
"use_two_stage_pipeline": use_two_stage_pipeline, "use_distilled_pipeline": use_distilled_pipeline,
|
"use_two_stage_pipeline": use_two_stage_pipeline, "use_distilled_pipeline": use_distilled_pipeline, "clear_lora_before_state_two": clear_lora_before_state_two,
|
||||||
"video_patchifier": self.video_patchifier, "audio_patchifier": self.audio_patchifier,
|
"video_patchifier": self.video_patchifier, "audio_patchifier": self.audio_patchifier,
|
||||||
}
|
}
|
||||||
for unit in self.units:
|
for unit in self.units:
|
||||||
@@ -417,8 +432,8 @@ class LTX2AudioVideoUnit_PromptEmbedder(PipelineUnit):
|
|||||||
class LTX2AudioVideoUnit_NoiseInitializer(PipelineUnit):
|
class LTX2AudioVideoUnit_NoiseInitializer(PipelineUnit):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
super().__init__(
|
super().__init__(
|
||||||
input_params=("height", "width", "num_frames", "seed", "rand_device", "use_two_stage_pipeline"),
|
input_params=("height", "width", "num_frames", "seed", "rand_device", "frame_rate", "use_two_stage_pipeline"),
|
||||||
output_params=("video_noise", "audio_noise",),
|
output_params=("video_noise", "audio_noise", "video_positions", "audio_positions", "video_latent_shape", "audio_latent_shape")
|
||||||
)
|
)
|
||||||
|
|
||||||
def process_stage(self, pipe: LTX2AudioVideoPipeline, height, width, num_frames, seed, rand_device, frame_rate=24.0):
|
def process_stage(self, pipe: LTX2AudioVideoPipeline, height, width, num_frames, seed, rand_device, frame_rate=24.0):
|
||||||
@@ -471,7 +486,6 @@ class LTX2AudioVideoUnit_InputVideoEmbedder(PipelineUnit):
|
|||||||
if pipe.scheduler.training:
|
if pipe.scheduler.training:
|
||||||
return {"video_latents": input_latents, "input_latents": input_latents}
|
return {"video_latents": input_latents, "input_latents": input_latents}
|
||||||
else:
|
else:
|
||||||
# TODO: implement video-to-video
|
|
||||||
raise NotImplementedError("Video-to-video not implemented yet.")
|
raise NotImplementedError("Video-to-video not implemented yet.")
|
||||||
|
|
||||||
class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit):
|
class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit):
|
||||||
@@ -495,14 +509,13 @@ class LTX2AudioVideoUnit_InputAudioEmbedder(PipelineUnit):
|
|||||||
if pipe.scheduler.training:
|
if pipe.scheduler.training:
|
||||||
return {"audio_latents": audio_input_latents, "audio_input_latents": audio_input_latents, "audio_positions": audio_positions, "audio_latent_shape": audio_latent_shape}
|
return {"audio_latents": audio_input_latents, "audio_input_latents": audio_input_latents, "audio_positions": audio_positions, "audio_latent_shape": audio_latent_shape}
|
||||||
else:
|
else:
|
||||||
# TODO: implement video-to-video
|
raise NotImplementedError("Audio-to-video not supported.")
|
||||||
raise NotImplementedError("Video-to-video not implemented yet.")
|
|
||||||
|
|
||||||
class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit):
|
class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
super().__init__(
|
super().__init__(
|
||||||
input_params=("input_images", "input_images_indexes", "input_images_strength", "video_latents", "height", "width", "num_frames", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"),
|
input_params=("input_images", "input_images_indexes", "input_images_strength", "video_latents", "height", "width", "num_frames", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"),
|
||||||
output_params=("video_latents"),
|
output_params=("video_latents", "denoise_mask_video", "input_latents_video", "stage2_input_latents"),
|
||||||
onload_model_names=("video_vae_encoder")
|
onload_model_names=("video_vae_encoder")
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -537,6 +550,54 @@ class LTX2AudioVideoUnit_InputImagesEmbedder(PipelineUnit):
|
|||||||
return output_dicts
|
return output_dicts
|
||||||
|
|
||||||
|
|
||||||
|
class LTX2AudioVideoUnit_InContextVideoEmbedder(PipelineUnit):
|
||||||
|
def __init__(self):
|
||||||
|
super().__init__(
|
||||||
|
input_params=("in_context_videos", "height", "width", "num_frames", "frame_rate", "in_context_downsample_factor", "tiled", "tile_size_in_pixels", "tile_overlap_in_pixels", "use_two_stage_pipeline"),
|
||||||
|
output_params=("in_context_video_latents", "in_context_video_positions"),
|
||||||
|
onload_model_names=("video_vae_encoder")
|
||||||
|
)
|
||||||
|
|
||||||
|
def check_in_context_video(self, pipe, in_context_video, height, width, num_frames, in_context_downsample_factor, use_two_stage_pipeline=True):
|
||||||
|
if in_context_video is None or len(in_context_video) == 0:
|
||||||
|
raise ValueError("In-context video is None or empty.")
|
||||||
|
in_context_video = in_context_video[:num_frames]
|
||||||
|
expected_height = height // in_context_downsample_factor // 2 if use_two_stage_pipeline else height // in_context_downsample_factor
|
||||||
|
expected_width = width // in_context_downsample_factor // 2 if use_two_stage_pipeline else width // in_context_downsample_factor
|
||||||
|
current_h, current_w, current_f = in_context_video[0].size[1], in_context_video[0].size[0], len(in_context_video)
|
||||||
|
h, w, f = pipe.check_resize_height_width(expected_height, expected_width, current_f, verbose=0)
|
||||||
|
if current_h != h or current_w != w:
|
||||||
|
in_context_video = [img.resize((w, h)) for img in in_context_video]
|
||||||
|
if current_f != f:
|
||||||
|
# pad black frames at the end
|
||||||
|
in_context_video = in_context_video + [Image.new("RGB", (w, h), (0, 0, 0))] * (f - current_f)
|
||||||
|
return in_context_video
|
||||||
|
|
||||||
|
def process(self, pipe: LTX2AudioVideoPipeline, in_context_videos, height, width, num_frames, frame_rate, in_context_downsample_factor, tiled, tile_size_in_pixels, tile_overlap_in_pixels, use_two_stage_pipeline=True):
|
||||||
|
if in_context_videos is None or len(in_context_videos) == 0:
|
||||||
|
return {}
|
||||||
|
else:
|
||||||
|
pipe.load_models_to_device(self.onload_model_names)
|
||||||
|
latents, positions = [], []
|
||||||
|
for in_context_video in in_context_videos:
|
||||||
|
in_context_video = self.check_in_context_video(pipe, in_context_video, height, width, num_frames, in_context_downsample_factor, use_two_stage_pipeline)
|
||||||
|
in_context_video = pipe.preprocess_video(in_context_video)
|
||||||
|
in_context_latents = pipe.video_vae_encoder.encode(in_context_video, tiled, tile_size_in_pixels, tile_overlap_in_pixels).to(dtype=pipe.torch_dtype, device=pipe.device)
|
||||||
|
|
||||||
|
latent_coords = pipe.video_patchifier.get_patch_grid_bounds(output_shape=VideoLatentShape.from_torch_shape(in_context_latents.shape), device=pipe.device)
|
||||||
|
video_positions = get_pixel_coords(latent_coords, VIDEO_SCALE_FACTORS, True).float()
|
||||||
|
video_positions[:, 0, ...] = video_positions[:, 0, ...] / frame_rate
|
||||||
|
video_positions[:, 1, ...] *= in_context_downsample_factor # height axis
|
||||||
|
video_positions[:, 2, ...] *= in_context_downsample_factor # width axis
|
||||||
|
video_positions = video_positions.to(pipe.torch_dtype)
|
||||||
|
|
||||||
|
latents.append(in_context_latents)
|
||||||
|
positions.append(video_positions)
|
||||||
|
latents = torch.cat(latents, dim=1)
|
||||||
|
positions = torch.cat(positions, dim=1)
|
||||||
|
return {"in_context_video_latents": latents, "in_context_video_positions": positions}
|
||||||
|
|
||||||
|
|
||||||
def model_fn_ltx2(
|
def model_fn_ltx2(
|
||||||
dit: LTXModel,
|
dit: LTXModel,
|
||||||
video_latents=None,
|
video_latents=None,
|
||||||
@@ -549,6 +610,8 @@ def model_fn_ltx2(
|
|||||||
audio_patchifier=None,
|
audio_patchifier=None,
|
||||||
timestep=None,
|
timestep=None,
|
||||||
denoise_mask_video=None,
|
denoise_mask_video=None,
|
||||||
|
in_context_video_latents=None,
|
||||||
|
in_context_video_positions=None,
|
||||||
use_gradient_checkpointing=False,
|
use_gradient_checkpointing=False,
|
||||||
use_gradient_checkpointing_offload=False,
|
use_gradient_checkpointing_offload=False,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
@@ -558,16 +621,25 @@ def model_fn_ltx2(
|
|||||||
# patchify
|
# patchify
|
||||||
b, c_v, f, h, w = video_latents.shape
|
b, c_v, f, h, w = video_latents.shape
|
||||||
video_latents = video_patchifier.patchify(video_latents)
|
video_latents = video_patchifier.patchify(video_latents)
|
||||||
|
seq_len_video = video_latents.shape[1]
|
||||||
video_timesteps = timestep.repeat(1, video_latents.shape[1], 1)
|
video_timesteps = timestep.repeat(1, video_latents.shape[1], 1)
|
||||||
if denoise_mask_video is not None:
|
if denoise_mask_video is not None:
|
||||||
video_timesteps = video_patchifier.patchify(denoise_mask_video) * video_timesteps
|
video_timesteps = video_patchifier.patchify(denoise_mask_video) * video_timesteps
|
||||||
|
|
||||||
|
if in_context_video_latents is not None:
|
||||||
|
in_context_video_latents = video_patchifier.patchify(in_context_video_latents)
|
||||||
|
in_context_video_timesteps = timestep.repeat(1, in_context_video_latents.shape[1], 1) * 0.
|
||||||
|
video_latents = torch.cat([video_latents, in_context_video_latents], dim=1)
|
||||||
|
video_positions = torch.cat([video_positions, in_context_video_positions], dim=2)
|
||||||
|
video_timesteps = torch.cat([video_timesteps, in_context_video_timesteps], dim=1)
|
||||||
|
|
||||||
if audio_latents is not None:
|
if audio_latents is not None:
|
||||||
_, c_a, _, mel_bins = audio_latents.shape
|
_, c_a, _, mel_bins = audio_latents.shape
|
||||||
audio_latents = audio_patchifier.patchify(audio_latents)
|
audio_latents = audio_patchifier.patchify(audio_latents)
|
||||||
audio_timesteps = timestep.repeat(1, audio_latents.shape[1], 1)
|
audio_timesteps = timestep.repeat(1, audio_latents.shape[1], 1)
|
||||||
else:
|
else:
|
||||||
audio_timesteps = None
|
audio_timesteps = None
|
||||||
#TODO: support gradient checkpointing in training
|
|
||||||
vx, ax = dit(
|
vx, ax = dit(
|
||||||
video_latents=video_latents,
|
video_latents=video_latents,
|
||||||
video_positions=video_positions,
|
video_positions=video_positions,
|
||||||
@@ -577,7 +649,11 @@ def model_fn_ltx2(
|
|||||||
audio_positions=audio_positions,
|
audio_positions=audio_positions,
|
||||||
audio_context=audio_context,
|
audio_context=audio_context,
|
||||||
audio_timesteps=audio_timesteps,
|
audio_timesteps=audio_timesteps,
|
||||||
|
use_gradient_checkpointing=use_gradient_checkpointing,
|
||||||
|
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
vx = vx[:, :seq_len_video, ...]
|
||||||
# unpatchify
|
# unpatchify
|
||||||
vx = video_patchifier.unpatchify_video(vx, f, h, w)
|
vx = video_patchifier.unpatchify_video(vx, f, h, w)
|
||||||
ax = audio_patchifier.unpatchify_audio(ax, c_a, mel_bins) if ax is not None else None
|
ax = audio_patchifier.unpatchify_audio(ax, c_a, mel_bins) if ax is not None else None
|
||||||
|
|||||||
@@ -299,7 +299,7 @@ class ZImageUnit_PromptEmbedder(PipelineUnit):
|
|||||||
|
|
||||||
def process(self, pipe: ZImagePipeline, prompt, edit_image):
|
def process(self, pipe: ZImagePipeline, prompt, edit_image):
|
||||||
pipe.load_models_to_device(self.onload_model_names)
|
pipe.load_models_to_device(self.onload_model_names)
|
||||||
if hasattr(pipe, "dit") and pipe.dit.siglip_embedder is not None:
|
if hasattr(pipe, "dit") and pipe.dit is not None and pipe.dit.siglip_embedder is not None:
|
||||||
# Z-Image-Turbo and Z-Image-Omni-Base use different prompt encoding methods.
|
# Z-Image-Turbo and Z-Image-Omni-Base use different prompt encoding methods.
|
||||||
# We determine which encoding method to use based on the model architecture.
|
# We determine which encoding method to use based on the model architecture.
|
||||||
# If you are using two-stage split training,
|
# If you are using two-stage split training,
|
||||||
|
|||||||
@@ -116,7 +116,7 @@ class VideoData:
|
|||||||
if self.height is not None and self.width is not None:
|
if self.height is not None and self.width is not None:
|
||||||
return self.height, self.width
|
return self.height, self.width
|
||||||
else:
|
else:
|
||||||
height, width, _ = self.__getitem__(0).shape
|
width, height = self.__getitem__(0).size
|
||||||
return height, width
|
return height, width
|
||||||
|
|
||||||
def __getitem__(self, item):
|
def __getitem__(self, item):
|
||||||
|
|||||||
@@ -112,6 +112,8 @@ write_video_audio_ltx2(
|
|||||||
|Model ID|Additional Parameters|Inference|Low VRAM Inference|Full Training|Validation After Full Training|LoRA Training|Validation After LoRA Training|
|
|Model ID|Additional Parameters|Inference|Low VRAM Inference|Full Training|Validation After Full Training|LoRA Training|Validation After LoRA Training|
|
||||||
|-|-|-|-|-|-|-|-|
|
|-|-|-|-|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
||||||
|
|||||||
@@ -91,3 +91,4 @@ Set 0 or not set: indicates not enabling the binding function
|
|||||||
|----------------|---------------------------|-------------------|
|
|----------------|---------------------------|-------------------|
|
||||||
| Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU |
|
| Wan 14B series | --initialize_model_on_cpu | The 14B model needs to be initialized on the CPU |
|
||||||
| Qwen-Image series | --initialize_model_on_cpu | The model needs to be initialized on the CPU |
|
| Qwen-Image series | --initialize_model_on_cpu | The model needs to be initialized on the CPU |
|
||||||
|
| Z-Image series | --enable_npu_patch | Using NPU fusion operator to replace the corresponding operator in Z-image model to improve the performance of the model on NPU |
|
||||||
@@ -37,9 +37,9 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6
|
|||||||
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
||||||
cd DiffSynth-Studio
|
cd DiffSynth-Studio
|
||||||
# aarch64/ARM
|
# aarch64/ARM
|
||||||
pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu"
|
pip install -e .[npu_aarch64]
|
||||||
# x86
|
# x86
|
||||||
pip install -e .[npu]
|
pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu"
|
||||||
|
|
||||||
When using Ascend NPU, please replace `"cuda"` with `"npu"` in your Python code. For details, see [NPU Support](../Pipeline_Usage/GPU_support.md#ascend-npu).
|
When using Ascend NPU, please replace `"cuda"` with `"npu"` in your Python code. For details, see [NPU Support](../Pipeline_Usage/GPU_support.md#ascend-npu).
|
||||||
|
|
||||||
|
|||||||
@@ -27,6 +27,7 @@ Welcome to DiffSynth-Studio's Documentation
|
|||||||
Model_Details/Qwen-Image
|
Model_Details/Qwen-Image
|
||||||
Model_Details/FLUX2
|
Model_Details/FLUX2
|
||||||
Model_Details/Z-Image
|
Model_Details/Z-Image
|
||||||
|
Model_Details/LTX-2
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|||||||
@@ -112,6 +112,8 @@ write_video_audio_ltx2(
|
|||||||
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|
||||||
|-|-|-|-|-|-|-|-|
|
|-|-|-|-|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
|[Lightricks/LTX-2: OneStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/full/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_full/LTX-2-T2AV.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Union-Control](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Union-Control)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Union-Control.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Union-Control.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|
|[Lightricks/LTX-2-19b-IC-LoRA-Detailer](https://www.modelscope.cn/models/Lightricks/LTX-2-19b-IC-LoRA-Detailer)|`in_context_videos`,`in_context_downsample_factor`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-IC-LoRA-Detailer.py)|-|-|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/lora/LTX-2-T2AV-IC-LoRA-splited.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_training/validate_lora/LTX-2-T2AV-IC-LoRA.py)|
|
||||||
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: TwoStagePipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-TwoStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-TwoStage.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
|[Lightricks/LTX-2: DistilledPipeline-T2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)||[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-T2AV-DistilledPipeline.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-T2AV-DistilledPipeline.py)|-|-|-|-|
|
||||||
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
|[Lightricks/LTX-2: OneStagePipeline-I2AV](https://www.modelscope.cn/models/Lightricks/LTX-2)|`input_images`|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference/LTX-2-I2AV-OneStage.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ltx2/model_inference_low_vram/LTX-2-I2AV-OneStage.py)|-|-|-|-|
|
||||||
|
|||||||
@@ -90,3 +90,4 @@ export CPU_AFFINITY_CONF=1
|
|||||||
|-----------|------|-------------------|
|
|-----------|------|-------------------|
|
||||||
| Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 |
|
| Wan 14B系列 | --initialize_model_on_cpu | 14B模型需要在cpu上进行初始化 |
|
||||||
| Qwen-Image系列 | --initialize_model_on_cpu | 模型需要在cpu上进行初始化 |
|
| Qwen-Image系列 | --initialize_model_on_cpu | 模型需要在cpu上进行初始化 |
|
||||||
|
| Z-Image 系列 | --enable_npu_patch | 使用NPU融合算子来替换Z-image模型中的对应算子以提升模型在NPU上的性能 |
|
||||||
@@ -37,9 +37,9 @@ pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6
|
|||||||
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
||||||
cd DiffSynth-Studio
|
cd DiffSynth-Studio
|
||||||
# aarch64/ARM
|
# aarch64/ARM
|
||||||
pip install -e .[npu_aarch64] --extra-index-url "https://download.pytorch.org/whl/cpu"
|
pip install -e .[npu_aarch64]
|
||||||
# x86
|
# x86
|
||||||
pip install -e .[npu]
|
pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu"
|
||||||
|
|
||||||
使用 Ascend NPU 时,请将 Python 代码中的 `"cuda"` 改为 `"npu"`,详见[NPU 支持](../Pipeline_Usage/GPU_support.md#ascend-npu)。
|
使用 Ascend NPU 时,请将 Python 代码中的 `"cuda"` 改为 `"npu"`,详见[NPU 支持](../Pipeline_Usage/GPU_support.md#ascend-npu)。
|
||||||
|
|
||||||
|
|||||||
@@ -27,6 +27,7 @@
|
|||||||
Model_Details/Qwen-Image
|
Model_Details/Qwen-Image
|
||||||
Model_Details/FLUX2
|
Model_Details/FLUX2
|
||||||
Model_Details/Z-Image
|
Model_Details/Z-Image
|
||||||
|
Model_Details/LTX-2
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|||||||
@@ -46,7 +46,6 @@ negative_prompt = (
|
|||||||
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
)
|
)
|
||||||
height, width, num_frames = 512 * 2, 768 * 2, 121
|
height, width, num_frames = 512 * 2, 768 * 2, 121
|
||||||
height, width, num_frames = 512 * 2, 768 * 2, 121
|
|
||||||
dataset_snapshot_download(
|
dataset_snapshot_download(
|
||||||
dataset_id="DiffSynth-Studio/examples_in_diffsynth",
|
dataset_id="DiffSynth-Studio/examples_in_diffsynth",
|
||||||
local_dir="./",
|
local_dir="./",
|
||||||
|
|||||||
77
examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py
Normal file
77
examples/ltx2/model_inference/LTX-2-T2AV-IC-LoRA-Detailer.py
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
import torch
|
||||||
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
|
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
|
||||||
|
from diffsynth.utils.data import VideoData
|
||||||
|
from modelscope import dataset_snapshot_download
|
||||||
|
|
||||||
|
vram_config = {
|
||||||
|
"offload_dtype": torch.bfloat16,
|
||||||
|
"offload_device": "cpu",
|
||||||
|
"onload_dtype": torch.bfloat16,
|
||||||
|
"onload_device": "cuda",
|
||||||
|
"preparing_dtype": torch.bfloat16,
|
||||||
|
"preparing_device": "cuda",
|
||||||
|
"computation_dtype": torch.bfloat16,
|
||||||
|
"computation_device": "cuda",
|
||||||
|
}
|
||||||
|
pipe = LTX2AudioVideoPipeline.from_pretrained(
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device="cuda",
|
||||||
|
model_configs=[
|
||||||
|
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
|
||||||
|
],
|
||||||
|
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
|
||||||
|
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
|
||||||
|
)
|
||||||
|
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Detailer", origin_file_pattern="ltx-2-19b-ic-lora-detailer.safetensors"))
|
||||||
|
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
|
||||||
|
|
||||||
|
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
|
||||||
|
negative_prompt = (
|
||||||
|
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||||
|
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||||
|
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||||
|
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||||
|
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||||
|
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||||
|
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||||
|
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||||
|
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||||
|
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||||
|
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
|
)
|
||||||
|
height, width, num_frames = 512 * 2, 768 * 2, 121
|
||||||
|
ref_scale_factor = 1
|
||||||
|
frame_rate = 24
|
||||||
|
# the frame rate of the video should better be the same with the reference video
|
||||||
|
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
|
||||||
|
input_video = VideoData("data/example_video_dataset/ltx2/video1.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
|
||||||
|
input_video = input_video.raw_data()
|
||||||
|
video, audio = pipe(
|
||||||
|
prompt=prompt,
|
||||||
|
negative_prompt=negative_prompt,
|
||||||
|
seed=43,
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
|
num_frames=num_frames,
|
||||||
|
frame_rate=frame_rate,
|
||||||
|
in_context_videos=[input_video],
|
||||||
|
in_context_downsample_factor=ref_scale_factor,
|
||||||
|
tiled=True,
|
||||||
|
use_two_stage_pipeline=True,
|
||||||
|
clear_lora_before_state_two=True,
|
||||||
|
)
|
||||||
|
write_video_audio_ltx2(
|
||||||
|
video=video,
|
||||||
|
audio=audio,
|
||||||
|
output_path='ltx2_twostage_iclora.mp4',
|
||||||
|
fps=frame_rate,
|
||||||
|
audio_sample_rate=24000,
|
||||||
|
)
|
||||||
@@ -0,0 +1,77 @@
|
|||||||
|
import torch
|
||||||
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
|
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
|
||||||
|
from diffsynth.utils.data import VideoData
|
||||||
|
from modelscope import dataset_snapshot_download
|
||||||
|
|
||||||
|
vram_config = {
|
||||||
|
"offload_dtype": torch.bfloat16,
|
||||||
|
"offload_device": "cpu",
|
||||||
|
"onload_dtype": torch.bfloat16,
|
||||||
|
"onload_device": "cuda",
|
||||||
|
"preparing_dtype": torch.bfloat16,
|
||||||
|
"preparing_device": "cuda",
|
||||||
|
"computation_dtype": torch.bfloat16,
|
||||||
|
"computation_device": "cuda",
|
||||||
|
}
|
||||||
|
pipe = LTX2AudioVideoPipeline.from_pretrained(
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device="cuda",
|
||||||
|
model_configs=[
|
||||||
|
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
|
||||||
|
],
|
||||||
|
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
|
||||||
|
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
|
||||||
|
)
|
||||||
|
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Union-Control", origin_file_pattern="ltx-2-19b-ic-lora-union-control-ref0.5.safetensors"))
|
||||||
|
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
|
||||||
|
|
||||||
|
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
|
||||||
|
negative_prompt = (
|
||||||
|
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||||
|
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||||
|
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||||
|
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||||
|
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||||
|
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||||
|
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||||
|
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||||
|
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||||
|
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||||
|
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
|
)
|
||||||
|
height, width, num_frames = 512 * 2, 768 * 2, 121
|
||||||
|
ref_scale_factor = 2
|
||||||
|
frame_rate = 24
|
||||||
|
# the frame rate of the video should better be the same with the reference video
|
||||||
|
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
|
||||||
|
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
|
||||||
|
input_video = input_video.raw_data()
|
||||||
|
video, audio = pipe(
|
||||||
|
prompt=prompt,
|
||||||
|
negative_prompt=negative_prompt,
|
||||||
|
seed=43,
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
|
num_frames=num_frames,
|
||||||
|
frame_rate=frame_rate,
|
||||||
|
in_context_videos=[input_video],
|
||||||
|
in_context_downsample_factor=ref_scale_factor,
|
||||||
|
tiled=True,
|
||||||
|
use_two_stage_pipeline=True,
|
||||||
|
clear_lora_before_state_two=True,
|
||||||
|
)
|
||||||
|
write_video_audio_ltx2(
|
||||||
|
video=video,
|
||||||
|
audio=audio,
|
||||||
|
output_path='ltx2_twostage_iclora.mp4',
|
||||||
|
fps=frame_rate,
|
||||||
|
audio_sample_rate=24000,
|
||||||
|
)
|
||||||
@@ -0,0 +1,77 @@
|
|||||||
|
import torch
|
||||||
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
|
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
|
||||||
|
from diffsynth.utils.data import VideoData
|
||||||
|
from modelscope import dataset_snapshot_download
|
||||||
|
|
||||||
|
vram_config = {
|
||||||
|
"offload_dtype": torch.float8_e5m2,
|
||||||
|
"offload_device": "cpu",
|
||||||
|
"onload_dtype": torch.float8_e5m2,
|
||||||
|
"onload_device": "cpu",
|
||||||
|
"preparing_dtype": torch.float8_e5m2,
|
||||||
|
"preparing_device": "cuda",
|
||||||
|
"computation_dtype": torch.bfloat16,
|
||||||
|
"computation_device": "cuda",
|
||||||
|
}
|
||||||
|
pipe = LTX2AudioVideoPipeline.from_pretrained(
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device="cuda",
|
||||||
|
model_configs=[
|
||||||
|
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
|
||||||
|
],
|
||||||
|
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
|
||||||
|
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
|
||||||
|
)
|
||||||
|
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Detailer", origin_file_pattern="ltx-2-19b-ic-lora-detailer.safetensors"))
|
||||||
|
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
|
||||||
|
|
||||||
|
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
|
||||||
|
negative_prompt = (
|
||||||
|
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||||
|
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||||
|
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||||
|
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||||
|
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||||
|
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||||
|
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||||
|
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||||
|
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||||
|
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||||
|
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
|
)
|
||||||
|
height, width, num_frames = 512 * 2, 768 * 2, 121
|
||||||
|
ref_scale_factor = 1
|
||||||
|
frame_rate = 24
|
||||||
|
# the frame rate of the video should better be the same with the reference video
|
||||||
|
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
|
||||||
|
input_video = VideoData("data/example_video_dataset/ltx2/video1.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
|
||||||
|
input_video = input_video.raw_data()
|
||||||
|
video, audio = pipe(
|
||||||
|
prompt=prompt,
|
||||||
|
negative_prompt=negative_prompt,
|
||||||
|
seed=43,
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
|
num_frames=num_frames,
|
||||||
|
frame_rate=frame_rate,
|
||||||
|
in_context_videos=[input_video],
|
||||||
|
in_context_downsample_factor=ref_scale_factor,
|
||||||
|
tiled=True,
|
||||||
|
use_two_stage_pipeline=True,
|
||||||
|
clear_lora_before_state_two=True,
|
||||||
|
)
|
||||||
|
write_video_audio_ltx2(
|
||||||
|
video=video,
|
||||||
|
audio=audio,
|
||||||
|
output_path='ltx2_twostage_iclora.mp4',
|
||||||
|
fps=frame_rate,
|
||||||
|
audio_sample_rate=24000,
|
||||||
|
)
|
||||||
@@ -0,0 +1,77 @@
|
|||||||
|
import torch
|
||||||
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
|
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
|
||||||
|
from diffsynth.utils.data import VideoData
|
||||||
|
from modelscope import dataset_snapshot_download
|
||||||
|
|
||||||
|
vram_config = {
|
||||||
|
"offload_dtype": torch.float8_e5m2,
|
||||||
|
"offload_device": "cpu",
|
||||||
|
"onload_dtype": torch.float8_e5m2,
|
||||||
|
"onload_device": "cpu",
|
||||||
|
"preparing_dtype": torch.float8_e5m2,
|
||||||
|
"preparing_device": "cuda",
|
||||||
|
"computation_dtype": torch.bfloat16,
|
||||||
|
"computation_device": "cuda",
|
||||||
|
}
|
||||||
|
pipe = LTX2AudioVideoPipeline.from_pretrained(
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device="cuda",
|
||||||
|
model_configs=[
|
||||||
|
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
|
||||||
|
],
|
||||||
|
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
|
||||||
|
stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
|
||||||
|
)
|
||||||
|
pipe.load_lora(pipe.dit, ModelConfig(model_id="Lightricks/LTX-2-19b-IC-LoRA-Union-Control", origin_file_pattern="ltx-2-19b-ic-lora-union-control-ref0.5.safetensors"))
|
||||||
|
dataset_snapshot_download("DiffSynth-Studio/example_video_dataset", allow_file_pattern="ltx2/*", local_dir="data/example_video_dataset")
|
||||||
|
|
||||||
|
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
|
||||||
|
negative_prompt = (
|
||||||
|
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||||
|
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||||
|
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||||
|
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||||
|
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||||
|
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||||
|
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||||
|
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||||
|
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||||
|
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||||
|
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
|
)
|
||||||
|
height, width, num_frames = 512 * 2, 768 * 2, 121
|
||||||
|
ref_scale_factor = 2
|
||||||
|
frame_rate = 24
|
||||||
|
# the frame rate of the video should better be the same with the reference video
|
||||||
|
# the spatial resolution of the first frame should be the resolution of stage 1 video generation divided by ref_scale_factor
|
||||||
|
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
|
||||||
|
input_video = input_video.raw_data()
|
||||||
|
video, audio = pipe(
|
||||||
|
prompt=prompt,
|
||||||
|
negative_prompt=negative_prompt,
|
||||||
|
seed=43,
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
|
num_frames=num_frames,
|
||||||
|
frame_rate=frame_rate,
|
||||||
|
in_context_videos=[input_video],
|
||||||
|
in_context_downsample_factor=ref_scale_factor,
|
||||||
|
tiled=True,
|
||||||
|
use_two_stage_pipeline=True,
|
||||||
|
clear_lora_before_state_two=True,
|
||||||
|
)
|
||||||
|
write_video_audio_ltx2(
|
||||||
|
video=video,
|
||||||
|
audio=audio,
|
||||||
|
output_path='ltx2_twostage_iclora.mp4',
|
||||||
|
fps=frame_rate,
|
||||||
|
audio_sample_rate=24000,
|
||||||
|
)
|
||||||
@@ -6,7 +6,7 @@ accelerate launch examples/ltx2/model_training/train.py \
|
|||||||
--extra_inputs "input_audio" \
|
--extra_inputs "input_audio" \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49 \
|
--num_frames 121 \
|
||||||
--dataset_repeat 1 \
|
--dataset_repeat 1 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
@@ -23,7 +23,7 @@ accelerate launch --config_file examples/qwen_image/model_training/full/accelera
|
|||||||
--extra_inputs "input_audio" \
|
--extra_inputs "input_audio" \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49 \
|
--num_frames 121 \
|
||||||
--dataset_repeat 100 \
|
--dataset_repeat 100 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
|
|||||||
@@ -0,0 +1,39 @@
|
|||||||
|
# Splited Training
|
||||||
|
accelerate launch examples/ltx2/model_training/train.py \
|
||||||
|
--dataset_base_path data/example_video_dataset/ltx2 \
|
||||||
|
--dataset_metadata_path data/example_video_dataset/ltx2_t2av_iclora.json \
|
||||||
|
--data_file_keys "video,input_audio,in_context_videos" \
|
||||||
|
--extra_inputs "input_audio,in_context_videos,in_context_downsample_factor,frame_rate" \
|
||||||
|
--height 512 \
|
||||||
|
--width 768 \
|
||||||
|
--num_frames 81 \
|
||||||
|
--dataset_repeat 1 \
|
||||||
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
||||||
|
--learning_rate 1e-4 \
|
||||||
|
--num_epochs 5 \
|
||||||
|
--remove_prefix_in_ckpt "pipe.dit." \
|
||||||
|
--output_path "./models/train/LTX2-T2AV-IC-LoRA-splited-cache" \
|
||||||
|
--lora_base_model "dit" \
|
||||||
|
--lora_target_modules "to_k,to_q,to_v,to_out.0" \
|
||||||
|
--lora_rank 32 \
|
||||||
|
--use_gradient_checkpointing \
|
||||||
|
--task "sft:data_process"
|
||||||
|
|
||||||
|
accelerate launch examples/ltx2/model_training/train.py \
|
||||||
|
--dataset_base_path ./models/train/LTX2-T2AV-IC-LoRA-splited-cache \
|
||||||
|
--data_file_keys "video,input_audio,in_context_videos" \
|
||||||
|
--extra_inputs "input_audio,in_context_videos,in_context_downsample_factor,frame_rate" \
|
||||||
|
--height 512 \
|
||||||
|
--width 768 \
|
||||||
|
--num_frames 81 \
|
||||||
|
--dataset_repeat 100 \
|
||||||
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
||||||
|
--learning_rate 1e-4 \
|
||||||
|
--num_epochs 5 \
|
||||||
|
--remove_prefix_in_ckpt "pipe.dit." \
|
||||||
|
--output_path "./models/train/LTX2-T2AV-IC-LoRA" \
|
||||||
|
--lora_base_model "dit" \
|
||||||
|
--lora_target_modules "to_k,to_q,to_v,to_out.0" \
|
||||||
|
--lora_rank 32 \
|
||||||
|
--use_gradient_checkpointing \
|
||||||
|
--task "sft:train"
|
||||||
@@ -24,7 +24,7 @@ accelerate launch examples/ltx2/model_training/train.py \
|
|||||||
--dataset_metadata_path data/example_video_dataset/ltx2_t2av.csv \
|
--dataset_metadata_path data/example_video_dataset/ltx2_t2av.csv \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49\
|
--num_frames 121\
|
||||||
--dataset_repeat 1 \
|
--dataset_repeat 1 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
@@ -42,7 +42,7 @@ accelerate launch examples/ltx2/model_training/train.py \
|
|||||||
--dataset_base_path ./models/train/LTX2-T2AV-noaudio_lora-splited-cache \
|
--dataset_base_path ./models/train/LTX2-T2AV-noaudio_lora-splited-cache \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49\
|
--num_frames 121\
|
||||||
--dataset_repeat 100 \
|
--dataset_repeat 100 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
|
|||||||
@@ -27,7 +27,7 @@ accelerate launch examples/ltx2/model_training/train.py \
|
|||||||
--extra_inputs "input_audio" \
|
--extra_inputs "input_audio" \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49 \
|
--num_frames 121 \
|
||||||
--dataset_repeat 1 \
|
--dataset_repeat 1 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
@@ -46,7 +46,7 @@ accelerate launch examples/ltx2/model_training/train.py \
|
|||||||
--extra_inputs "input_audio" \
|
--extra_inputs "input_audio" \
|
||||||
--height 512 \
|
--height 512 \
|
||||||
--width 768 \
|
--width 768 \
|
||||||
--num_frames 49 \
|
--num_frames 121 \
|
||||||
--dataset_repeat 100 \
|
--dataset_repeat 100 \
|
||||||
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
--model_id_with_origin_paths "DiffSynth-Studio/LTX-2-Repackage:transformer.safetensors" \
|
||||||
--learning_rate 1e-4 \
|
--learning_rate 1e-4 \
|
||||||
|
|||||||
@@ -1,7 +1,6 @@
|
|||||||
import torch, os, argparse, accelerate, warnings
|
import torch, os, argparse, accelerate, warnings
|
||||||
from diffsynth.core import UnifiedDataset
|
from diffsynth.core import UnifiedDataset
|
||||||
from diffsynth.core.data.operators import LoadAudioWithTorchaudio, ToAbsolutePath
|
from diffsynth.core.data.operators import LoadAudioWithTorchaudio, ToAbsolutePath, RouteByType, SequencialProcess
|
||||||
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
|
|
||||||
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
from diffsynth.diffusion import *
|
from diffsynth.diffusion import *
|
||||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||||
@@ -69,6 +68,7 @@ class LTX2TrainingModule(DiffusionTrainingModule):
|
|||||||
"height": data["video"][0].size[1],
|
"height": data["video"][0].size[1],
|
||||||
"width": data["video"][0].size[0],
|
"width": data["video"][0].size[0],
|
||||||
"num_frames": len(data["video"]),
|
"num_frames": len(data["video"]),
|
||||||
|
"frame_rate": data.get("frame_rate", 24),
|
||||||
# Please do not modify the following parameters
|
# Please do not modify the following parameters
|
||||||
# unless you clearly know what this will cause.
|
# unless you clearly know what this will cause.
|
||||||
"cfg_scale": 1,
|
"cfg_scale": 1,
|
||||||
@@ -108,24 +108,29 @@ if __name__ == "__main__":
|
|||||||
gradient_accumulation_steps=args.gradient_accumulation_steps,
|
gradient_accumulation_steps=args.gradient_accumulation_steps,
|
||||||
kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)],
|
kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)],
|
||||||
)
|
)
|
||||||
|
video_processor = UnifiedDataset.default_video_operator(
|
||||||
|
base_path=args.dataset_base_path,
|
||||||
|
max_pixels=args.max_pixels,
|
||||||
|
height=args.height,
|
||||||
|
width=args.width,
|
||||||
|
height_division_factor=32,
|
||||||
|
width_division_factor=32,
|
||||||
|
num_frames=args.num_frames,
|
||||||
|
time_division_factor=8,
|
||||||
|
time_division_remainder=1,
|
||||||
|
)
|
||||||
dataset = UnifiedDataset(
|
dataset = UnifiedDataset(
|
||||||
base_path=args.dataset_base_path,
|
base_path=args.dataset_base_path,
|
||||||
metadata_path=args.dataset_metadata_path,
|
metadata_path=args.dataset_metadata_path,
|
||||||
repeat=args.dataset_repeat,
|
repeat=args.dataset_repeat,
|
||||||
data_file_keys=args.data_file_keys.split(","),
|
data_file_keys=args.data_file_keys.split(","),
|
||||||
main_data_operator=UnifiedDataset.default_video_operator(
|
main_data_operator=video_processor,
|
||||||
base_path=args.dataset_base_path,
|
|
||||||
max_pixels=args.max_pixels,
|
|
||||||
height=args.height,
|
|
||||||
width=args.width,
|
|
||||||
height_division_factor=16,
|
|
||||||
width_division_factor=16,
|
|
||||||
num_frames=args.num_frames,
|
|
||||||
time_division_factor=4,
|
|
||||||
time_division_remainder=1,
|
|
||||||
),
|
|
||||||
special_operator_map={
|
special_operator_map={
|
||||||
"input_audio": ToAbsolutePath(args.dataset_base_path) >> LoadAudioWithTorchaudio(duration=float(args.num_frames) / float(args.frame_rate)),
|
"input_audio": ToAbsolutePath(args.dataset_base_path) >> LoadAudioWithTorchaudio(duration=float(args.num_frames) / float(args.frame_rate)),
|
||||||
|
"in_context_videos": RouteByType(operator_map=[
|
||||||
|
(str, video_processor),
|
||||||
|
(list, SequencialProcess(video_processor)),
|
||||||
|
]),
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
model = LTX2TrainingModule(
|
model = LTX2TrainingModule(
|
||||||
|
|||||||
@@ -27,7 +27,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
|
|||||||
)
|
)
|
||||||
prompt = "A beautiful sunset over the ocean."
|
prompt = "A beautiful sunset over the ocean."
|
||||||
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
height, width, num_frames = 512, 768, 49
|
height, width, num_frames = 512, 768, 121
|
||||||
video, audio = pipe(
|
video, audio = pipe(
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
negative_prompt=negative_prompt,
|
negative_prompt=negative_prompt,
|
||||||
|
|||||||
@@ -0,0 +1,56 @@
|
|||||||
|
import torch
|
||||||
|
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
|
||||||
|
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
|
||||||
|
from diffsynth.utils.data import VideoData
|
||||||
|
|
||||||
|
vram_config = {
|
||||||
|
"offload_dtype": torch.bfloat16,
|
||||||
|
"offload_device": "cpu",
|
||||||
|
"onload_dtype": torch.bfloat16,
|
||||||
|
"onload_device": "cuda",
|
||||||
|
"preparing_dtype": torch.bfloat16,
|
||||||
|
"preparing_device": "cuda",
|
||||||
|
"computation_dtype": torch.bfloat16,
|
||||||
|
"computation_device": "cuda",
|
||||||
|
}
|
||||||
|
pipe = LTX2AudioVideoPipeline.from_pretrained(
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device="cuda",
|
||||||
|
model_configs=[
|
||||||
|
ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
|
||||||
|
ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
|
||||||
|
],
|
||||||
|
tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
|
||||||
|
)
|
||||||
|
pipe.load_lora(pipe.dit, "./models/train/LTX2-T2AV-IC-LoRA/epoch-4.safetensors")
|
||||||
|
prompt = "[VISUAL]:Two cute orange cats, wearing boxing gloves, stand on a boxing ring and fight each other. [SOUNDS]:the sound of two cats boxing"
|
||||||
|
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
|
height, width, num_frames = 512, 768, 81
|
||||||
|
ref_scale_factor = 2
|
||||||
|
frame_rate = 24
|
||||||
|
input_video = VideoData("data/example_video_dataset/ltx2/depth_video.mp4", height=height // ref_scale_factor // 2, width=width // ref_scale_factor // 2)
|
||||||
|
input_video = input_video.raw_data()
|
||||||
|
video, audio = pipe(
|
||||||
|
prompt=prompt,
|
||||||
|
negative_prompt=negative_prompt,
|
||||||
|
seed=43,
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
|
num_frames=num_frames,
|
||||||
|
frame_rate=frame_rate,
|
||||||
|
tiled=True,
|
||||||
|
in_context_videos=[input_video],
|
||||||
|
in_context_downsample_factor=ref_scale_factor,
|
||||||
|
)
|
||||||
|
write_video_audio_ltx2(
|
||||||
|
video=video,
|
||||||
|
audio=audio,
|
||||||
|
output_path='ltx2_onestage_ic.mp4',
|
||||||
|
fps=frame_rate,
|
||||||
|
audio_sample_rate=24000,
|
||||||
|
)
|
||||||
@@ -28,7 +28,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
|
|||||||
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV_lora/epoch-4.safetensors")
|
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV_lora/epoch-4.safetensors")
|
||||||
prompt = "A beautiful sunset over the ocean."
|
prompt = "A beautiful sunset over the ocean."
|
||||||
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
height, width, num_frames = 512, 768, 49
|
height, width, num_frames = 512, 768, 121
|
||||||
video, audio = pipe(
|
video, audio = pipe(
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
negative_prompt=negative_prompt,
|
negative_prompt=negative_prompt,
|
||||||
|
|||||||
@@ -28,7 +28,7 @@ pipe = LTX2AudioVideoPipeline.from_pretrained(
|
|||||||
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV-noaudio_lora/epoch-4.safetensors")
|
pipe.load_lora(pipe.dit, "models/train/LTX2-T2AV-noaudio_lora/epoch-4.safetensors")
|
||||||
prompt = "A beautiful sunset over the ocean."
|
prompt = "A beautiful sunset over the ocean."
|
||||||
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||||
height, width, num_frames = 512, 768, 49
|
height, width, num_frames = 512, 768, 121
|
||||||
video, audio = pipe(
|
video, audio = pipe(
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
negative_prompt=negative_prompt,
|
negative_prompt=negative_prompt,
|
||||||
|
|||||||
Reference in New Issue
Block a user