Wan video (#338)

2026-03-18 22:08:13 +00:00 · 2025-02-25 19:00:43 +08:00
parent 427232cbc0
commit af7d305f00
18 changed files with 3892 additions and 5 deletions
--- a/examples/wanvideo/README.md
+++ b/examples/wanvideo/README.md
@@ -0,0 +1,144 @@
+# Wan-Video
+
+Wan-Video is a collection of video synthesis models open-sourced by Alibaba.
+
+## Inference
+
+### Wan-Video-1.3B-T2V
+
+Wan-Video-1.3B-T2V supports text-to-video and video-to-video. See [`./wan_1.3b_text_to_video.py`](./wan_1.3b_text_to_video.py).
+
+Required VRAM: 6G
+
+https://github.com/user-attachments/assets/124397be-cd6a-4f29-a87c-e4c695aaabb8
+
+Put sunglasses on the dog.
+
+https://github.com/user-attachments/assets/272808d7-fbeb-4747-a6df-14a0860c75fb
+
+### Wan-Video-14B-T2V
+
+Wan-Video-14B-T2V is an enhanced version of Wan-Video-1.3B-T2V, offering greater size and power. To utilize this model, you need additional VRAM. We recommend that users adjust the `torch_dtype` and `num_persistent_param_in_dit` settings to find an optimal balance between speed and VRAM requirements. See [`./wan_14b_text_to_video.py`](./wan_14b_text_to_video.py).
+
+We present a detailed table here. The model is tested on a single A100.
+
+|`torch_dtype`|`num_persistent_param_in_dit`|Speed|Required VRAM|Default Setting|
+|-|-|-|-|-|
+|torch.bfloat16|None (unlimited)|18.5s/it|40G||
+|torch.bfloat16|7*10**9 (7B)|20.8s/it|24G||
+|torch.bfloat16|0|23.4s/it|10G||
+|torch.float8_e4m3fn|None (unlimited)|18.3s/it|24G|yes|
+|torch.float8_e4m3fn|0|24.0s/it|10G||
+
+https://github.com/user-attachments/assets/3908bc64-d451-485a-8b61-28f6d32dd92f
+
+### Wan-Video-14B-I2V
+
+Wan-Video-14B-I2V adds the functionality of image-to-video based on Wan-Video-14B-T2V. The model size remains the same, therefore the speed and VRAM requirements are also consistent. See [`./wan_14b_image_to_video.py`](./wan_14b_image_to_video.py).
+
+![Image](https://github.com/user-attachments/assets/adf8047f-7943-4aaa-a555-2b32dc415f39)
+
+https://github.com/user-attachments/assets/c0bdd5ca-292f-45ed-b9bc-afe193156e75
+
+## Train
+
+We support Wan-Video LoRA training. Here is a tutorial.
+
+Step 1: Install additional packages
+
+```
+pip install peft lightning pandas
+```
+
+Step 2: Prepare your dataset
+
+You need to manage the training videos as follows:
+
+```
+data/example_dataset/
+├── metadata.csv
+└── train
+    ├── video_00001.mp4
+    └── video_00002.mp4
+```
+
+`metadata.csv`:
+
+```
+file_name,text
+video_00001.mp4,"video description"
+video_00001.mp4,"video description"
+```
+
+Step 3: Data process
+
+```shell
+CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/train_wan_t2v.py \
+  --task data_process \
+  --dataset_path data/example_dataset \
+  --output_path ./models \
+  --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" \
+  --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" \
+  --tiled \
+  --num_frames 81 \
+  --height 480 \
+  --width 832
+```
+
+After that, some cached files will be stored in the dataset folder.
+
+```
+data/example_dataset/
+├── metadata.csv
+└── train
+    ├── video_00001.mp4
+    ├── video_00001.mp4.tensors.pth
+    ├── video_00002.mp4
+    └── video_00002.mp4.tensors.pth
+```
+
+Step 4: Train
+
+```shell
+CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/train_wan_t2v.py \
+  --task train \
+  --dataset_path data/example_dataset \
+  --output_path ./models \
+  --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" \
+  --steps_per_epoch 500 \
+  --max_epochs 10 \
+  --learning_rate 1e-4 \
+  --lora_rank 4 \
+  --lora_alpha 4 \
+  --lora_target_modules "q,k,v,o,ffn.0,ffn.2" \
+  --accumulate_grad_batches 1 \
+  --use_gradient_checkpointing
+```
+
+Step 5: Test
+
+```python
+import torch
+from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData
+
+
+model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")
+model_manager.load_models([
+    "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors",
+    "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth",
+    "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
+])
+model_manager.load_lora("models/lightning_logs/version_1/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
+
+pipe = WanVideoPipeline.from_model_manager(model_manager, device="cuda")
+pipe.enable_vram_management(num_persistent_param_in_dit=None)
+
+# Text-to-video
+video = pipe(
+    prompt="...",
+    negative_prompt="...",
+    num_inference_steps=50,
+    seed=0, tiled=True
+)
+save_video(video, "video_with_lora.mp4", fps=30, quality=5)
+```
--- a/examples/wanvideo/train_wan_t2v.py
+++ b/examples/wanvideo/train_wan_t2v.py
@@ -0,0 +1,460 @@
+import torch, os, imageio, argparse
+from torchvision.transforms import v2
+from einops import rearrange
+import lightning as pl
+import pandas as pd
+from diffsynth import WanVideoPipeline, ModelManager
+from peft import LoraConfig, inject_adapter_in_model
+import torchvision
+from PIL import Image
+
+
+
+class TextVideoDataset(torch.utils.data.Dataset):
+    def __init__(self, base_path, metadata_path, max_num_frames=81, frame_interval=1, num_frames=81, height=480, width=832):
+        metadata = pd.read_csv(metadata_path)
+        self.path = [os.path.join(base_path, "train", file_name) for file_name in metadata["file_name"]]
+        self.text = metadata["text"].to_list()
+        
+        self.max_num_frames = max_num_frames
+        self.frame_interval = frame_interval
+        self.num_frames = num_frames
+        self.height = height
+        self.width = width
+            
+        self.frame_process = v2.Compose([
+            v2.Resize(size=(height, width), antialias=True),
+            v2.CenterCrop(size=(height, width)),
+            v2.ToTensor(),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        
+        
+    def crop_and_resize(self, image):
+        width, height = image.size
+        scale = max(self.width / width, self.height / height)
+        image = torchvision.transforms.functional.resize(
+            image,
+            (round(height*scale), round(width*scale)),
+            interpolation=torchvision.transforms.InterpolationMode.BILINEAR
+        )
+        return image
+
+
+    def load_frames_using_imageio(self, file_path, max_num_frames, start_frame_id, interval, num_frames, frame_process):
+        reader = imageio.get_reader(file_path)
+        if reader.count_frames() < max_num_frames or reader.count_frames() - 1 < start_frame_id + (num_frames - 1) * interval:
+            reader.close()
+            return None
+        
+        frames = []
+        for frame_id in range(num_frames):
+            frame = reader.get_data(start_frame_id + frame_id * interval)
+            frame = Image.fromarray(frame)
+            frame = self.crop_and_resize(frame)
+            frame = frame_process(frame)
+            frames.append(frame)
+        reader.close()
+
+        frames = torch.stack(frames, dim=0)
+        frames = rearrange(frames, "T C H W -> C T H W")
+
+        return frames
+
+
+    def load_video(self, file_path):
+        start_frame_id = torch.randint(0, self.max_num_frames - (self.num_frames - 1) * self.frame_interval, (1,))[0]
+        frames = self.load_frames_using_imageio(file_path, self.max_num_frames, start_frame_id, self.frame_interval, self.num_frames, self.frame_process)
+        return frames
+    
+    
+    def load_text_video_raw_data(self, data_id):
+        text = self.path[data_id]
+        video = self.load_video(self.path[data_id])
+        data = {"text": text, "video": video}
+        return data
+
+
+    def __getitem__(self, data_id):
+        text = self.path[data_id]
+        path = self.path[data_id]
+        video = self.load_video(path)
+        data = {"text": text, "video": video, "path": path}
+        return data
+    
+
+    def __len__(self):
+        return len(self.path)
+
+
+
+class LightningModelForDataProcess(pl.LightningModule):
+    def __init__(self, text_encoder_path, vae_path, tiled=False, tile_size=(34, 34), tile_stride=(18, 16)):
+        super().__init__()
+        model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")
+        model_manager.load_models([text_encoder_path, vae_path])
+        self.pipe = WanVideoPipeline.from_model_manager(model_manager)
+
+        self.tiler_kwargs = {"tiled": tiled, "tile_size": tile_size, "tile_stride": tile_stride}
+        
+    def test_step(self, batch, batch_idx):
+        text, video, path = batch["text"][0], batch["video"], batch["path"][0]
+        self.pipe.device = self.device
+        if video is not None:
+            prompt_emb = self.pipe.encode_prompt(text)
+            latents = self.pipe.encode_video(video, **self.tiler_kwargs)[0]
+            data = {"latents": latents, "prompt_emb": prompt_emb}
+            torch.save(data, path + ".tensors.pth")
+
+
+
+class TensorDataset(torch.utils.data.Dataset):
+    def __init__(self, base_path, metadata_path, steps_per_epoch):
+        metadata = pd.read_csv(metadata_path)
+        self.path = [os.path.join(base_path, "train", file_name) for file_name in metadata["file_name"]]
+        print(len(self.path), "videos in metadata.")
+        self.path = [i + ".tensors.pth" for i in self.path if os.path.exists(i + ".tensors.pth")]
+        print(len(self.path), "tensors cached in metadata.")
+        assert len(self.path) > 0
+        
+        self.steps_per_epoch = steps_per_epoch
+
+
+    def __getitem__(self, index):
+        data_id = torch.randint(0, len(self.path), (1,))[0]
+        data_id = (data_id + index) % len(self.path) # For fixed seed.
+        path = self.path[data_id]
+        data = torch.load(path, weights_only=True, map_location="cpu")
+        return data
+    
+
+    def __len__(self):
+        return self.steps_per_epoch
+
+
+
+class LightningModelForTrain(pl.LightningModule):
+    def __init__(self, dit_path, learning_rate=1e-5, lora_rank=4, lora_alpha=4, lora_target_modules="q,k,v,o,ffn.0,ffn.2", init_lora_weights="kaiming", use_gradient_checkpointing=True):
+        super().__init__()
+        model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")
+        model_manager.load_models([dit_path])
+        
+        self.pipe = WanVideoPipeline.from_model_manager(model_manager)
+        self.pipe.scheduler.set_timesteps(1000, training=True)
+        self.freeze_parameters()
+        self.add_lora_to_model(
+            self.pipe.denoising_model(),
+            lora_rank=lora_rank,
+            lora_alpha=lora_alpha,
+            lora_target_modules=lora_target_modules,
+            init_lora_weights=init_lora_weights,
+        )
+        
+        self.learning_rate = learning_rate
+        self.use_gradient_checkpointing = use_gradient_checkpointing
+        
+        
+    def freeze_parameters(self):
+        # Freeze parameters
+        self.pipe.requires_grad_(False)
+        self.pipe.eval()
+        self.pipe.denoising_model().train()
+        
+        
+    def add_lora_to_model(self, model, lora_rank=4, lora_alpha=4, lora_target_modules="q,k,v,o,ffn.0,ffn.2", init_lora_weights="kaiming"):
+        # Add LoRA to UNet
+        self.lora_alpha = lora_alpha
+        if init_lora_weights == "kaiming":
+            init_lora_weights = True
+            
+        lora_config = LoraConfig(
+            r=lora_rank,
+            lora_alpha=lora_alpha,
+            init_lora_weights=init_lora_weights,
+            target_modules=lora_target_modules.split(","),
+        )
+        model = inject_adapter_in_model(lora_config, model)
+        for param in model.parameters():
+            # Upcast LoRA parameters into fp32
+            if param.requires_grad:
+                param.data = param.to(torch.float32)
+    
+
+    def training_step(self, batch, batch_idx):
+        # Data
+        latents = batch["latents"].to(self.device)
+        prompt_emb = batch["prompt_emb"]
+        prompt_emb["context"] = [prompt_emb["context"][0][0].to(self.device)]
+        
+        # Loss
+        noise = torch.randn_like(latents)
+        timestep_id = torch.randint(0, self.pipe.scheduler.num_train_timesteps, (1,))
+        timestep = self.pipe.scheduler.timesteps[timestep_id].to(self.device)
+        extra_input = self.pipe.prepare_extra_input(latents)
+        noisy_latents = self.pipe.scheduler.add_noise(latents, noise, timestep)
+        training_target = self.pipe.scheduler.training_target(latents, noise, timestep)
+
+        # Compute loss
+        with torch.amp.autocast(dtype=torch.bfloat16, device_type=torch.device(self.device).type):
+            noise_pred = self.pipe.denoising_model()(
+                noisy_latents, timestep=timestep, **prompt_emb, **extra_input,
+                use_gradient_checkpointing=self.use_gradient_checkpointing
+            )
+            loss = torch.nn.functional.mse_loss(noise_pred.float(), training_target.float())
+            loss = loss * self.pipe.scheduler.training_weight(timestep)
+
+        # Record log
+        self.log("train_loss", loss, prog_bar=True)
+        return loss
+
+
+    def configure_optimizers(self):
+        trainable_modules = filter(lambda p: p.requires_grad, self.pipe.denoising_model().parameters())
+        optimizer = torch.optim.AdamW(trainable_modules, lr=self.learning_rate)
+        return optimizer
+    
+
+    def on_save_checkpoint(self, checkpoint):
+        checkpoint.clear()
+        trainable_param_names = list(filter(lambda named_param: named_param[1].requires_grad, self.pipe.denoising_model().named_parameters()))
+        trainable_param_names = set([named_param[0] for named_param in trainable_param_names])
+        state_dict = self.pipe.denoising_model().state_dict()
+        lora_state_dict = {}
+        for name, param in state_dict.items():
+            if name in trainable_param_names:
+                lora_state_dict[name] = param
+        checkpoint.update(lora_state_dict)
+
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Simple example of a training script.")
+    parser.add_argument(
+        "--task",
+        type=str,
+        default="data_process",
+        required=True,
+        choices=["data_process", "train"],
+        help="Task. `data_process` or `train`.",
+    )
+    parser.add_argument(
+        "--dataset_path",
+        type=str,
+        default=None,
+        required=True,
+        help="The path of the Dataset.",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./",
+        help="Path to save the model.",
+    )
+    parser.add_argument(
+        "--text_encoder_path",
+        type=str,
+        default=None,
+        help="Path of text encoder.",
+    )
+    parser.add_argument(
+        "--vae_path",
+        type=str,
+        default=None,
+        help="Path of VAE.",
+    )
+    parser.add_argument(
+        "--dit_path",
+        type=str,
+        default=None,
+        help="Path of DiT.",
+    )
+    parser.add_argument(
+        "--tiled",
+        default=False,
+        action="store_true",
+        help="Whether enable tile encode in VAE. This option can reduce VRAM required.",
+    )
+    parser.add_argument(
+        "--tile_size_height",
+        type=int,
+        default=34,
+        help="Tile size (height) in VAE.",
+    )
+    parser.add_argument(
+        "--tile_size_width",
+        type=int,
+        default=34,
+        help="Tile size (width) in VAE.",
+    )
+    parser.add_argument(
+        "--tile_stride_height",
+        type=int,
+        default=18,
+        help="Tile stride (height) in VAE.",
+    )
+    parser.add_argument(
+        "--tile_stride_width",
+        type=int,
+        default=16,
+        help="Tile stride (width) in VAE.",
+    )
+    parser.add_argument(
+        "--steps_per_epoch",
+        type=int,
+        default=500,
+        help="Number of steps per epoch.",
+    )
+    parser.add_argument(
+        "--num_frames",
+        type=int,
+        default=81,
+        help="Number of frames.",
+    )
+    parser.add_argument(
+        "--height",
+        type=int,
+        default=480,
+        help="Image height.",
+    )
+    parser.add_argument(
+        "--width",
+        type=int,
+        default=832,
+        help="Image width.",
+    )
+    parser.add_argument(
+        "--dataloader_num_workers",
+        type=int,
+        default=1,
+        help="Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=1e-5,
+        help="Learning rate.",
+    )
+    parser.add_argument(
+        "--accumulate_grad_batches",
+        type=int,
+        default=1,
+        help="The number of batches in gradient accumulation.",
+    )
+    parser.add_argument(
+        "--max_epochs",
+        type=int,
+        default=1,
+        help="Number of epochs.",
+    )
+    parser.add_argument(
+        "--lora_target_modules",
+        type=str,
+        default="q,k,v,o,ffn.0,ffn.2",
+        help="Layers with LoRA modules.",
+    )
+    parser.add_argument(
+        "--init_lora_weights",
+        type=str,
+        default="kaiming",
+        choices=["gaussian", "kaiming"],
+        help="The initializing method of LoRA weight.",
+    )
+    parser.add_argument(
+        "--training_strategy",
+        type=str,
+        default="auto",
+        choices=["auto", "deepspeed_stage_1", "deepspeed_stage_2", "deepspeed_stage_3"],
+        help="Training strategy",
+    )
+    parser.add_argument(
+        "--lora_rank",
+        type=int,
+        default=4,
+        help="The dimension of the LoRA update matrices.",
+    )
+    parser.add_argument(
+        "--lora_alpha",
+        type=float,
+        default=4.0,
+        help="The weight of the LoRA update matrices.",
+    )
+    parser.add_argument(
+        "--use_gradient_checkpointing",
+        default=False,
+        action="store_true",
+        help="Whether to use gradient checkpointing.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def data_process(args):
+    dataset = TextVideoDataset(
+        args.dataset_path,
+        os.path.join(args.dataset_path, "metadata.csv"),
+        max_num_frames=args.num_frames,
+        frame_interval=1,
+        num_frames=args.num_frames,
+        height=args.height,
+        width=args.width
+    )
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        shuffle=False,
+        batch_size=1,
+        num_workers=args.dataloader_num_workers
+    )
+    model = LightningModelForDataProcess(
+        text_encoder_path=args.text_encoder_path,
+        vae_path=args.vae_path,
+        tiled=args.tiled,
+        tile_size=(args.tile_size_height, args.tile_size_width),
+        tile_stride=(args.tile_stride_height, args.tile_stride_width),
+    )
+    trainer = pl.Trainer(
+        accelerator="gpu",
+        devices="auto",
+        default_root_dir=args.output_path,
+    )
+    trainer.test(model, dataloader)
+    
+    
+def train(args):
+    dataset = TensorDataset(
+        args.dataset_path,
+        os.path.join(args.dataset_path, "metadata.csv"),
+        steps_per_epoch=args.steps_per_epoch,
+    )
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        shuffle=True,
+        batch_size=1,
+        num_workers=args.dataloader_num_workers
+    )
+    model = LightningModelForTrain(
+        dit_path=args.dit_path,
+        learning_rate=args.learning_rate,
+        lora_rank=args.lora_rank,
+        lora_alpha=args.lora_alpha,
+        lora_target_modules=args.lora_target_modules,
+        init_lora_weights=args.init_lora_weights,
+        use_gradient_checkpointing=args.use_gradient_checkpointing
+    )
+    trainer = pl.Trainer(
+        max_epochs=args.max_epochs,
+        accelerator="gpu",
+        devices="auto",
+        strategy=args.training_strategy,
+        default_root_dir=args.output_path,
+        accumulate_grad_batches=args.accumulate_grad_batches,
+        callbacks=[pl.pytorch.callbacks.ModelCheckpoint(save_top_k=-1)]
+    )
+    trainer.fit(model, dataloader)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    if args.task == "data_process":
+        data_process(args)
+    elif args.task == "train":
+        train(args)
--- a/examples/wanvideo/wan_1.3b_text_to_video.py
+++ b/examples/wanvideo/wan_1.3b_text_to_video.py
@@ -0,0 +1,40 @@
+import torch
+from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData
+from modelscope import snapshot_download
+
+
+# Download models
+snapshot_download("Wan-AI/Wan2.1-T2V-1.3B", cache_dir="models")
+
+# Load models
+model_manager = ModelManager(device="cpu")
+model_manager.load_models(
+    [
+        "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors",
+        "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth",
+        "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
+    ],
+    torch_dtype=torch.bfloat16, # You can set `torch_dtype=torch.float8_e4m3fn` to enable FP8 quantization.
+)
+pipe = WanVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device="cuda")
+pipe.enable_vram_management(num_persistent_param_in_dit=None)
+
+# Text-to-video
+video = pipe(
+    prompt="纪实摄影风格画面，一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄，两只耳朵立起，神情专注而欢快。阳光洒在它身上，使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地，偶尔点缀着几朵野花，远处隐约可见蓝天和几片白云。透视感鲜明，捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。",
+    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    num_inference_steps=50,
+    seed=0, tiled=True
+)
+save_video(video, "video1.mp4", fps=15, quality=5)
+
+# Video-to-video
+video = VideoData("video1.mp4", height=480, width=832)
+video = pipe(
+    prompt="纪实摄影风格画面，一只活泼的小狗戴着黑色墨镜在绿茵茵的草地上迅速奔跑。小狗毛色棕黄，戴着黑色墨镜，两只耳朵立起，神情专注而欢快。阳光洒在它身上，使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地，偶尔点缀着几朵野花，远处隐约可见蓝天和几片白云。透视感鲜明，捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。",
+    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    input_video=video, denoising_strength=0.7,
+    num_inference_steps=50,
+    seed=1, tiled=True
+)
+save_video(video, "video2.mp4", fps=15, quality=5)
--- a/examples/wanvideo/wan_14b_image_to_video.py
+++ b/examples/wanvideo/wan_14b_image_to_video.py
@@ -0,0 +1,48 @@
+import torch
+from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData
+from modelscope import snapshot_download, dataset_snapshot_download
+from PIL import Image
+
+
+# Download models
+snapshot_download("Wan-AI/Wan2.1-I2V-14B-480P", cache_dir="models")
+
+# Load models
+model_manager = ModelManager(device="cpu")
+model_manager.load_models(
+    [
+        [
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00001-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00002-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00003-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00004-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00005-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00006-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00007-of-00007.safetensors",
+        ],
+        "models/Wan-AI/Wan2.1-I2V-14B-480P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth",
+        "models/Wan-AI/Wan2.1-I2V-14B-480P/models_t5_umt5-xxl-enc-bf16.pth",
+        "models/Wan-AI/Wan2.1-I2V-14B-480P/Wan2.1_VAE.pth",
+    ],
+    torch_dtype=torch.float8_e4m3fn, # You can set `torch_dtype=torch.bfloat16` to disable FP8 quantization.
+)
+pipe = WanVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device="cuda")
+pipe.enable_vram_management(num_persistent_param_in_dit=None) # You can set `num_persistent_param_in_dit` to a small number to reduce VRAM required.
+
+# Download example image
+dataset_snapshot_download(
+    dataset_id="DiffSynth-Studio/examples_in_diffsynth",
+    local_dir="./",
+    allow_file_pattern=f"data/examples/wan/input_image.jpg"
+)
+image = Image.open("data/examples/wan/input_image.jpg")
+
+# Image-to-video
+video = pipe(
+    prompt="一艘小船正勇敢地乘风破浪前行。蔚蓝的大海波涛汹涌，白色的浪花拍打着船身，但小船毫不畏惧，坚定地驶向远方。阳光洒在水面上，闪烁着金色的光芒，为这壮丽的场景增添了一抹温暖。镜头拉近，可以看到船上的旗帜迎风飘扬，象征着不屈的精神与冒险的勇气。这段画面充满力量，激励人心，展现了面对挑战时的无畏与执着。",
+    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    input_image=image,
+    num_inference_steps=50,
+    seed=0, tiled=True
+)
+save_video(video, "video.mp4", fps=15, quality=5)
--- a/examples/wanvideo/wan_14b_text_to_video.py
+++ b/examples/wanvideo/wan_14b_text_to_video.py
@@ -0,0 +1,38 @@
+import torch
+from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData
+from modelscope import snapshot_download
+
+
+# Download models
+snapshot_download("Wan-AI/Wan2.1-T2V-14B", cache_dir="models")
+
+# Load models
+model_manager = ModelManager(device="cpu")
+model_manager.load_models(
+    [
+        [
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00001-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00002-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00003-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00004-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00005-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00006-of-00007.safetensors",
+            "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00007-of-00007.safetensors",
+        ],
+        "models/Wan-AI/Wan2.1-T2V-14B/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth",
+        "models/Wan-AI/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth",
+        "models/Wan-AI/Wan2.1-T2V-14B/Wan2.1_VAE.pth",
+    ],
+    torch_dtype=torch.float8_e4m3fn, # You can set `torch_dtype=torch.bfloat16` to disable FP8 quantization.
+)
+pipe = WanVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device="cuda")
+pipe.enable_vram_management(num_persistent_param_in_dit=None) # You can set `num_persistent_param_in_dit` to a small number to reduce VRAM required.
+
+# Text-to-video
+video = pipe(
+    prompt="一名宇航员身穿太空服，面朝镜头骑着一匹机械马在火星表面驰骋。红色的荒凉地表延伸至远方，点缀着巨大的陨石坑和奇特的岩石结构。机械马的步伐稳健，扬起微弱的尘埃，展现出未来科技与原始探索的完美结合。宇航员手持操控装置，目光坚定，仿佛正在开辟人类的新疆域。背景是深邃的宇宙和蔚蓝的地球，画面既科幻又充满希望，让人不禁畅想未来的星际生活。",
+    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    num_inference_steps=50,
+    seed=0, tiled=True
+)
+save_video(video, "video1.mp4", fps=25, quality=5)