diff --git a/examples/flux/README_zh.md b/examples/flux/README_zh.md index e4bab2c..9cdaafa 100644 --- a/examples/flux/README_zh.md +++ b/examples/flux/README_zh.md @@ -18,7 +18,7 @@ pip install -e . ## 快速开始 -通过运行以下代码可以快速加载 FLUX.1-dev 模型并进行推理。 +通过运行以下代码可以快速加载 [black-forest-labs/FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev) 模型并进行推理。 ```python import torch @@ -41,12 +41,20 @@ image.save("image.jpg") ## 模型总览 -**FLUX 系列模型的全新框架支持正在开发中,敬请期待!** - -|模型 ID|额外参数|推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证| -|-|-|-|-|-|-|-| -|[black-forest-labs/FLUX.1-dev](https://modelscope.cn/models/black-forest-labs/FLUX.1-dev)||[code](./model_inference/FLUX.1-dev.py)|[code](./model_training/full/FLUX.1-dev.sh)|[code](./model_training/validate_full/FLUX.1-dev.py)|[code](./model_training/lora/FLUX.1-dev.sh)|[code](./model_training/validate_lora/FLUX.1-dev.py)| -|[black-forest-labs/FLUX.1-Kontext-dev](https://modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev)|`kontext_images`|[code](./model_inference/FLUX.1-Kontext-dev.py)|[code](./model_training/full/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_full/FLUX.1-Kontext-dev.py)|[code](./model_training/lora/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Kontext-dev.py)| +|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证| +|-|-|-|-|-|-|-|-| +|[FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev)||[code](./model_inference/FLUX.1-dev.py)|[code](./model_inference_low_vram/FLUX.1-dev.py)|[code](./model_training/full/FLUX.1-dev.sh)|[code](./model_training/validate_full/FLUX.1-dev.py)|[code](./model_training/lora/FLUX.1-dev.sh)|[code](./model_training/validate_lora/FLUX.1-dev.py)| +|[FLUX.1-Kontext-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev)|`kontext_images`|[code](./model_inference/FLUX.1-Kontext-dev.py)|[code](./model_inference_low_vram/FLUX.1-Kontext-dev.py)|[code](./model_training/full/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_full/FLUX.1-Kontext-dev.py)|[code](./model_training/lora/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Kontext-dev.py)| +|[FLUX.1-dev-Controlnet-Inpainting-Beta](https://www.modelscope.cn/models/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Inpainting-Beta.py)| +|[FLUX.1-dev-Controlnet-Union-alpha](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Union-alpha.py)| +|[FLUX.1-dev-Controlnet-Upscaler](https://www.modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Upscaler)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Upscaler.py)| +|[FLUX.1-dev-IP-Adapter](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-IP-Adapter)|`ipadapter_images`, `ipadapter_scale`|[code](./model_inference/FLUX.1-dev-IP-Adapter.py)|[code](./model_inference_low_vram/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/full/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_full/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/lora/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_lora/FLUX.1-dev-IP-Adapter.py)| +|[FLUX.1-dev-InfiniteYou](https://www.modelscope.cn/models/ByteDance/InfiniteYou)|`infinityou_id_image`, `infinityou_guidance`, `controlnet_inputs`|[code](./model_inference/FLUX.1-dev-InfiniteYou.py)|[code](./model_inference_low_vram/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/full/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_full/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/lora/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_lora/FLUX.1-dev-InfiniteYou.py)| +|[FLUX.1-dev-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Eligen)|`eligen_entity_prompts`, `eligen_entity_masks`, `eligen_enable_on_negative`, `eligen_enable_inpaint`|[code](./model_inference/FLUX.1-dev-EliGen.py)|[code](./model_inference_low_vram/FLUX.1-dev-EliGen.py)||||| +|[FLUX.1-dev-LoRA-Encoder](https://www.modelscope.cn/models/DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev)|`lora_encoder_inputs`, `lora_encoder_scale`|[code](./model_inference/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_inference_low_vram/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_training/full/FLUX.1-dev-LoRA-Encoder.sh)|[code](./model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py)|-|-| +|[Step1X-Edit](https://www.modelscope.cn/models/stepfun-ai/Step1X-Edit)|`step1x_reference_image`|[code](./model_inference/Step1X-Edit.py)|[code](./model_inference_low_vram/Step1X-Edit.py)|[code](./model_training/full/Step1X-Edit.sh)|[code](./model_training/validate_full/Step1X-Edit.py)|[code](./model_training/lora/Step1X-Edit.sh)|[code](./model_training/validate_lora/Step1X-Edit.py)| +|[FLEX.2-preview](https://www.modelscope.cn/models/ostris/Flex.2-preview)|`flex_inpaint_image`, `flex_inpaint_mask`, `flex_control_image`, `flex_control_strength`, `flex_control_stop`|[code](./model_inference/FLEX.2-preview.py)|[code](./model_inference_low_vram/FLEX.2-preview.py)|[code](./model_training/full/FLEX.2-preview.sh)|[code](./model_training/validate_full/FLEX.2-preview.py)|[code](./model_training/lora/FLEX.2-preview.sh)|[code](./model_training/validate_lora/FLEX.2-preview.py)| +|[Nexus-Gen](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2)|||||||| ## 模型推理 @@ -59,6 +67,9 @@ image.save("image.jpg") 模型通过 `from_pretrained` 加载: ```python +import torch +from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig + pipe = FluxImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", @@ -123,9 +134,41 @@ pipe = FluxImagePipeline.from_pretrained( pipe.enable_vram_management() ``` -`enable_vram_management` 函数提供了以下参数,用于控制显存使用情况: +FP8 量化功能也是支持的: -* `vram_limit`: 显存占用量(GB),默认占用设备上的剩余显存。注意这不是一个绝对限制,当设置的显存不足以支持模型进行推理,但实际可用显存足够时,将会以最小化显存占用的形式进行推理。将其设置为0时,将会实现理论最小显存占用。 +```python +pipe = FluxImagePipeline.from_pretrained( + torch_dtype=torch.bfloat16, + device="cuda", + model_configs=[ + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_dtype=torch.float8_e4m3fn), + ], +) +pipe.enable_vram_management() +``` + +FP8 量化和 offload 可同时开启: + +```python +pipe = FluxImagePipeline.from_pretrained( + torch_dtype=torch.bfloat16, + device="cuda", + model_configs=[ + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn), + ], +) +pipe.enable_vram_management() +``` + +开启显存管理后,框架会自动根据设备上的剩余显存确定显存管理策略。对于大多数 FLUX 系列模型,最低 8GB 显存即可进行推理。`enable_vram_management` 函数提供了以下参数,用于手动控制显存管理策略: + +* `vram_limit`: 显存占用量限制(GB),默认占用设备上的剩余显存。注意这不是一个绝对限制,当设置的显存不足以支持模型进行推理,但实际可用显存足够时,将会以最小化显存占用的形式进行推理。将其设置为0时,将会实现理论最小显存占用。 * `vram_buffer`: 显存缓冲区大小(GB),默认为 0.5GB。由于部分较大的神经网络层在 onload 阶段会不可控地占用更多显存,因此一个显存缓冲区是必要的,理论上的最优值为模型中最大的层所占的显存。 * `num_persistent_param_in_dit`: DiT 模型中常驻显存的参数数量(个),默认为无限制。我们将会在未来删除这个参数,请不要依赖这个参数。 @@ -163,6 +206,25 @@ Pipeline 在推理阶段能够接收以下输入参数: * `controlnet_inputs`: ControlNet 模型的输入。 * `ipadapter_images`: IP-Adapter 模型的输入图像。 * `ipadapter_scale`: IP-Adapter 模型的控制强度。 +* `eligen_entity_prompts`: EliGen 模型的图像局部提示词。 +* `eligen_entity_masks`: EliGen 模型的局部提示词控制区域,与 `eligen_entity_prompts` 一一对应。 +* `eligen_enable_on_negative`: 是否在负向提示词一侧启用 EliGen,仅在 `cfg_scale > 1` 时生效。 +* `eligen_enable_inpaint`: 是否启用 EliGen 局部重绘。 +* `infinityou_id_image`: InfiniteYou 模型的人脸图像。 +* `infinityou_guidance`: InfiniteYou 模型的控制强度。 +* `flex_inpaint_image`: FLEX 模型用于局部重绘的图像。 +* `flex_inpaint_mask`: FLEX 模型用于局部重绘的区域。 +* `flex_control_image`: FLEX 模型用于结构控制的图像。 +* `flex_control_strength`: FLEX 模型用于结构控制的强度。 +* `flex_control_stop`: FLEX 模型结构控制的结束点,1表示全程启用,0.5表示在前半段启用,0表示不启用。 +* `step1x_reference_image`: Step1x-Edit 模型用于图像编辑的输入图像。 +* `lora_encoder_inputs`: LoRA 编码器的输入,格式为 ModelConfig 或本地路径。 +* `lora_encoder_scale`: LoRA 编码器的激活强度,默认值为1,数值越小,LoRA 激活越弱。 +* `tea_cache_l1_thresh`: TeaCache 的阈值,数值越大,速度越快,画面质量越差。请注意,开启 TeaCache 后推理速度并非均匀,因此进度条上显示的剩余时间将会变得不准确。 +* `tiled`: 是否启用 VAE 分块推理,默认为 `False`。设置为 `True` 时可显著减少 VAE 编解码阶段的显存占用,会产生少许误差,以及少量推理时间延长。 +* `tile_size`: VAE 编解码阶段的分块大小,默认为 128,仅在 `tiled=True` 时生效。 +* `tile_stride`: VAE 编解码阶段的分块步长,默认为 64,仅在 `tiled=True` 时生效,需保证其数值小于或等于 `tile_size`。 +* `progress_bar_cmd`: 进度条,默认为 `tqdm.tqdm`。可通过设置为 `lambda x:x` 来屏蔽进度条。 @@ -190,7 +252,7 @@ FLUX 系列模型训练通过统一的 [`./model_training/train.py`](./model_tra * `--model_id_with_origin_paths`: 带原始路径的模型 ID,例如 black-forest-labs/FLUX.1-dev:flux1-dev.safetensors。用逗号分隔。 * 训练 * `--learning_rate`: 学习率。 - * `--num_epochs`: 轮数(Epoch)数量。 + * `--num_epochs`: 轮数(Epoch)。 * `--output_path`: 保存路径。 * `--remove_prefix_in_ckpt`: 在 ckpt 中移除前缀。 * 可训练模块 @@ -205,7 +267,7 @@ FLUX 系列模型训练通过统一的 [`./model_training/train.py`](./model_tra * `--use_gradient_checkpointing_offload`: 是否将 gradient checkpointing 卸载到内存中。 * `--gradient_accumulation_steps`: 梯度累积步数。 * 其他 - * `--align_to_opensource_format`: 是否将 FLUX DiT LoRA 的格式与开源版本对齐,仅对 FLUX.1-dev 和 FLUX.1-Kontext-dev 的 LoRA 训练生效。 + * `--align_to_opensource_format`: 是否将 FLUX DiT LoRA 的格式与开源版本对齐,仅对 LoRA 训练生效。 此外,训练框架基于 [`accelerate`](https://huggingface.co/docs/accelerate/index) 构建,在开始训练前运行 `accelerate config` 可配置 GPU 的相关参数。对于部分模型训练(例如模型的全量训练)脚本,我们提供了建议的 `accelerate` 配置文件,可在对应的训练脚本中查看。 diff --git a/examples/flux/model_inference/EliGen.py b/examples/flux/model_inference/FLUX.1-dev-EliGen.py similarity index 100% rename from examples/flux/model_inference/EliGen.py rename to examples/flux/model_inference/FLUX.1-dev-EliGen.py diff --git a/examples/flux/model_inference_low_vram/EliGen.py b/examples/flux/model_inference_low_vram/FLUX.1-dev-EliGen.py similarity index 100% rename from examples/flux/model_inference_low_vram/EliGen.py rename to examples/flux/model_inference_low_vram/FLUX.1-dev-EliGen.py diff --git a/examples/flux/model_training/full/FLUX.1-dev-LoRA-Encoder.sh b/examples/flux/model_training/full/FLUX.1-dev-LoRA-Encoder.sh new file mode 100644 index 0000000..f0d4f97 --- /dev/null +++ b/examples/flux/model_training/full/FLUX.1-dev-LoRA-Encoder.sh @@ -0,0 +1,14 @@ +accelerate launch examples/flux/model_training/train.py \ + --dataset_base_path data/example_image_dataset \ + --dataset_metadata_path data/example_image_dataset/metadata_lora_encoder.csv \ + --data_file_keys "image" \ + --max_pixels 1048576 \ + --dataset_repeat 100 \ + --model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev:model.safetensors" \ + --learning_rate 1e-5 \ + --num_epochs 1 \ + --remove_prefix_in_ckpt "pipe.lora_encoder." \ + --output_path "./models/train/FLUX.1-dev-LoRA-Encoder_full" \ + --trainable_models "lora_encoder" \ + --extra_inputs "lora_encoder_inputs" \ + --use_gradient_checkpointing diff --git a/examples/flux/model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py b/examples/flux/model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py new file mode 100644 index 0000000..166f5a4 --- /dev/null +++ b/examples/flux/model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py @@ -0,0 +1,25 @@ +import torch +from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig +from diffsynth import load_state_dict + + +pipe = FluxImagePipeline.from_pretrained( + torch_dtype=torch.bfloat16, + device="cuda", + model_configs=[ + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"), + ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"), + ModelConfig(model_id="DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev", origin_file_pattern="model.safetensors"), + ], +) +pipe.enable_lora_magic() +state_dict = load_state_dict("models/train/FLUX.1-dev-LoRA-Encoder_full/epoch-0.safetensors") +pipe.lora_encoder.load_state_dict(state_dict) + +lora = ModelConfig(model_id="VoidOc/flux_animal_forest1", origin_file_pattern="20.safetensors") +pipe.load_lora(pipe.dit, lora, hotload=True) # Use `pipe.clear_lora()` to drop the loaded LoRA. + +image = pipe(prompt="", seed=0, lora_encoder_inputs=lora) +image.save("image_FLUX.1-dev-LoRA-Encoder_full.jpg") diff --git a/examples/wanvideo/README_zh.md b/examples/wanvideo/README_zh.md index 860ff83..d9cd43b 100644 --- a/examples/wanvideo/README_zh.md +++ b/examples/wanvideo/README_zh.md @@ -18,6 +18,8 @@ pip install -e . ## 快速开始 +通过运行以下代码可以快速加载 [Wan-AI/Wan2.1-T2V-1.3B](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) 模型并进行推理 + ```python import torch from diffsynth import save_video @@ -70,7 +72,6 @@ save_video(video, "video1.mp4", fps=15, quality=5) 以下部分将会帮助您理解我们的功能并编写推理代码。 -
加载模型 @@ -78,6 +79,9 @@ save_video(video, "video1.mp4", fps=15, quality=5) 模型通过 `from_pretrained` 加载: ```python +import torch +from diffsynth.pipelines.wan_video_new import WanVideoPipeline, ModelConfig + pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", @@ -178,9 +182,9 @@ pipe.enable_vram_management() FP8 量化能够大幅度减少显存占用,但不会加速,部分模型在 FP8 量化下会出现精度不足导致的画面模糊、撕裂、失真问题,请谨慎使用 FP8 量化。 -`enable_vram_management` 函数提供了以下参数,用于控制显存使用情况: +开启显存管理后,框架会自动根据设备上的剩余显存确定显存管理策略。`enable_vram_management` 函数提供了以下参数,用于手动控制显存管理策略: -* `vram_limit`: 显存占用量(GB),默认占用设备上的剩余显存。注意这不是一个绝对限制,当设置的显存不足以支持模型进行推理,但实际可用显存足够时,将会以最小化显存占用的形式进行推理。 +* `vram_limit`: 显存占用量限制(GB),默认占用设备上的剩余显存。注意这不是一个绝对限制,当设置的显存不足以支持模型进行推理,但实际可用显存足够时,将会以最小化显存占用的形式进行推理。将其设置为0时,将会实现理论最小显存占用。 * `vram_buffer`: 显存缓冲区大小(GB),默认为 0.5GB。由于部分较大的神经网络层在 onload 阶段会不可控地占用更多显存,因此一个显存缓冲区是必要的,理论上的最优值为模型中最大的层所占的显存。 * `num_persistent_param_in_dit`: DiT 模型中常驻显存的参数数量(个),默认为无限制。我们将会在未来删除这个参数,请不要依赖这个参数。 @@ -276,7 +280,7 @@ Wan 系列模型训练通过统一的 [`./model_training/train.py`](./model_trai * `--model_id_with_origin_paths`: 带原始路径的模型 ID,例如 Wan-AI/Wan2.1-T2V-1.3B:diffusion_pytorch_model*.safetensors。用逗号分隔。 * 训练 * `--learning_rate`: 学习率。 - * `--num_epochs`: 轮数(Epoch)数量。 + * `--num_epochs`: 轮数(Epoch)。 * `--output_path`: 保存路径。 * `--remove_prefix_in_ckpt`: 在 ckpt 中移除前缀。 * 可训练模块