support ltx2.3 inference

This commit is contained in:
mi804
2026-03-06 16:07:17 +08:00
parent c5aaa1da41
commit 73b13f4c86
17 changed files with 1608 additions and 351 deletions

View File

@@ -251,11 +251,27 @@ class Modality:
Input data for a single modality (video or audio) in the transformer.
Bundles the latent tokens, timestep embeddings, positional information,
and text conditioning context for processing by the diffusion transformer.
Attributes:
latent: Patchified latent tokens, shape ``(B, T, D)`` where *B* is
the batch size, *T* is the total number of tokens (noisy +
conditioning), and *D* is the input dimension.
timesteps: Per-token timestep embeddings, shape ``(B, T)``.
positions: Positional coordinates, shape ``(B, 3, T)`` for video
(time, height, width) or ``(B, 1, T)`` for audio.
context: Text conditioning embeddings from the prompt encoder.
enabled: Whether this modality is active in the current forward pass.
context_mask: Optional mask for the text context tokens.
attention_mask: Optional 2-D self-attention mask, shape ``(B, T, T)``.
Values in ``[0, 1]`` where ``1`` = full attention and ``0`` = no
attention. ``None`` means unrestricted (full) attention between
all tokens. Built incrementally by conditioning items; see
:class:`~ltx_core.conditioning.types.attention_strength_wrapper.ConditioningItemAttentionStrengthWrapper`.
"""
latent: (
torch.Tensor
) # Shape: (B, T, D) where B is the batch size, T is the number of tokens, and D is input dimension
sigma: torch.Tensor # Shape: (B,). Current sigma value, used for cross-attention timestep calculation.
timesteps: torch.Tensor # Shape: (B, T) where T is the number of timesteps
positions: (
torch.Tensor
@@ -263,6 +279,7 @@ class Modality:
context: torch.Tensor
enabled: bool = True
context_mask: torch.Tensor | None = None
attention_mask: torch.Tensor | None = None
def to_denoised(