DiffSynth-Studio 2.0 major update

This commit is contained in:
root
2025-12-04 16:33:07 +08:00
parent afd101f345
commit 72af7122b3
758 changed files with 26462 additions and 2221398 deletions

View File

@@ -1,43 +0,0 @@
# FLUX Aesthetics Enhancement LoRA
## Introduction
This is a LoRA model trained for FLUX.1-dev, which enhances the aesthetic quality of images generated by the model. The improvements include, but are not limited to: rich details, beautiful lighting and shadows, aesthetic composition, and clear visuals. This model does not require any trigger words.
* Paper: https://arxiv.org/abs/2412.12888
* Github: https://github.com/modelscope/DiffSynth-Studio
* Model: [ModelScope](https://www.modelscope.cn/models/DiffSynth-Studio/ArtAug-lora-FLUX.1dev-v1), [HuggingFace](https://huggingface.co/ECNU-CILab/ArtAug-lora-FLUX.1dev-v1)
* Demo: [ModelScope](https://modelscope.cn/aigc/imageGeneration?tab=advanced&versionId=7228&modelType=LoRA&sdVersion=FLUX_1&modelUrl=modelscope%3A%2F%2FDiffSynth-Studio%2FArtAug-lora-FLUX.1dev-v1%3Frevision%3Dv1.0), HuggingFace (Coming soon)
## Methodology
![workflow](https://github.com/user-attachments/assets/cee969af-d49f-4480-911c-bedc1c095f9b)
The ArtAug project is inspired by reasoning approaches like GPT-o1, which rely on model interaction and self-correction. We developed a framework aimed at enhancing the capabilities of image generation models through interaction with image understanding models. The training process of ArtAug consists of the following steps:
1. **Synthesis-Understanding Interaction**: After generating an image using the image generation model, we employ a multimodal large language model (Qwen2-VL-72B) to analyze the image content and provide suggestions for modifications, which then lead to the regeneration of a higher quality image.
2. **Data Generation and Filtering**: Interactive generation involves long inference times and sometimes produce poor image content. Therefore, we generate a large batch of image pairs offline, filter them, and use them for subsequent training.
3. **Differential Training**: We apply differential training techniques to train a LoRA model, enabling it to learn the differences between images before and after enhancement, rather than directly training on the dataset of enhanced images.
4. **Iterative Enhancement**: The trained LoRA model is fused into the base model, and the entire process is repeated multiple times with the fused model until the interaction algorithm no longer provides significant enhancements. The LoRA models produced in each iteration are combined to produce this final model.
This model integrates the aesthetic understanding of Qwen2-VL-72B into FLUX.1[dev], leading to an improvement in the quality of generated images.
## Usage
Please see [./artaug_flux.py](./artaug_flux.py) for more details.
Since this model is encapsulated in the universal FLUX LoRA format, it can be loaded by most LoRA loaders, allowing you to integrate this LoRA model into your own workflow.
## Examples
|FLUX.1-dev|FLUX.1-dev + ArtAug LoRA|
|-|-|
|![image_1_base](https://github.com/user-attachments/assets/e1d5c505-b423-45fe-be01-25c2758f5417)|![image_1_enhance](https://github.com/user-attachments/assets/335908e3-d0bd-41c2-9d99-d10528a2d719)|
|![image_2_base](https://github.com/user-attachments/assets/7f38e8d4-3c62-492e-bd96-be60f0855037)|![image_2_enhance](https://github.com/user-attachments/assets/ae3a1daf-7a7c-44fd-bdbc-1d2a83bc3de3)|
|![image_3_base](https://github.com/user-attachments/assets/e2ae4879-9202-45d6-9df7-fbcbd2093d19)|![image_3_enhance](https://github.com/user-attachments/assets/4df6e5b9-65de-408b-88c6-51db39aad801)|
|![image_4_base](https://github.com/user-attachments/assets/dbc65387-60df-4a18-b1bb-45eaa5be5c1d)|![image_4_enhance](https://github.com/user-attachments/assets/fc19860d-3e28-468b-b013-8745255ac6db)|
|![image_5_base](https://github.com/user-attachments/assets/bb65c1ba-c0c6-4d3b-b3ef-bdbbb5f03a48)|![image_5_enhance](https://github.com/user-attachments/assets/03570c62-9a0b-428f-8c86-6e01c1421202)|
|![image_6_base](https://github.com/user-attachments/assets/18e9a4e7-2afd-4ca9-bc49-7736042c25dc)|![image_6_enhance](https://github.com/user-attachments/assets/aa73571f-098a-4e65-9eda-b9729ba379cd)|

View File

@@ -1,14 +0,0 @@
import torch
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
lora_path = download_customized_models(
model_id="DiffSynth-Studio/ArtAug-lora-FLUX.1dev-v1",
origin_file_path="merged_lora.safetensors",
local_dir="models/lora"
)[0]
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
model_manager.load_lora(lora_path, lora_alpha=1.0)
pipe = FluxImagePipeline.from_model_manager(model_manager)
image = pipe(prompt="a house", seed=0)
image.save("image_artaug.jpg")

View File

@@ -1,39 +0,0 @@
# CogVideoX
### Example: Text-to-Video using CogVideoX-5B (Experimental)
See [cogvideo_text_to_video.py](cogvideo_text_to_video.py).
First, we generate a video using prompt "an astronaut riding a horse on Mars".
https://github.com/user-attachments/assets/4c91c1cd-e4a0-471a-bd8d-24d761262941
Then, we convert the astronaut to a robot.
https://github.com/user-attachments/assets/225a00a4-2bc8-4740-8e86-a64b460a29ec
Upscale the video using the model itself.
https://github.com/user-attachments/assets/c02cb30c-de60-473c-8242-32c67b3155ad
Make the video look smoother by interpolating frames.
https://github.com/user-attachments/assets/f0e465b4-45df-4435-ab10-7a084ca2b0a0
Here is another example.
First, we generate a video using prompt "a dog is running".
https://github.com/user-attachments/assets/e3696297-99f5-4d0c-a5ca-1d1566db85b4
Then, we add a blue collar to the dog.
https://github.com/user-attachments/assets/7ff22be7-4390-4d33-ae6c-53f6f056e18d
Upscale the video using the model itself.
https://github.com/user-attachments/assets/a909c32c-0b7d-495c-a53c-d23a99a3d3e9
Make the video look smoother by interpolating frames.
https://github.com/user-attachments/assets/ea37c150-97a0-4858-8003-0c2e5eef3331

View File

@@ -1,73 +0,0 @@
from diffsynth import ModelManager, save_video, VideoData, download_models, CogVideoPipeline
from diffsynth.extensions.RIFE import RIFEInterpolater
import torch, os
os.environ["TOKENIZERS_PARALLELISM"] = "True"
def text_to_video(model_manager, prompt, seed, output_path):
pipe = CogVideoPipeline.from_model_manager(model_manager)
torch.manual_seed(seed)
video = pipe(
prompt=prompt,
height=480, width=720,
cfg_scale=7.0, num_inference_steps=200
)
save_video(video, output_path, fps=8, quality=5)
def edit_video(model_manager, prompt, seed, input_path, output_path):
pipe = CogVideoPipeline.from_model_manager(model_manager)
input_video = VideoData(video_file=input_path)
torch.manual_seed(seed)
video = pipe(
prompt=prompt,
height=480, width=720,
cfg_scale=7.0, num_inference_steps=200,
input_video=input_video, denoising_strength=0.7
)
save_video(video, output_path, fps=8, quality=5)
def self_upscale(model_manager, prompt, seed, input_path, output_path):
pipe = CogVideoPipeline.from_model_manager(model_manager)
input_video = VideoData(video_file=input_path, height=480*2, width=720*2).raw_data()
torch.manual_seed(seed)
video = pipe(
prompt=prompt,
height=480*2, width=720*2,
cfg_scale=7.0, num_inference_steps=30,
input_video=input_video, denoising_strength=0.4, tiled=True
)
save_video(video, output_path, fps=8, quality=7)
def interpolate_video(model_manager, input_path, output_path):
rife = RIFEInterpolater.from_model_manager(model_manager)
video = VideoData(video_file=input_path).raw_data()
video = rife.interpolate(video, num_iter=2)
save_video(video, output_path, fps=32, quality=5)
download_models(["CogVideoX-5B", "RIFE"])
model_manager = ModelManager(torch_dtype=torch.bfloat16)
model_manager.load_models([
"models/CogVideo/CogVideoX-5b/text_encoder",
"models/CogVideo/CogVideoX-5b/transformer",
"models/CogVideo/CogVideoX-5b/vae/diffusion_pytorch_model.safetensors",
"models/RIFE/flownet.pkl",
])
# Example 1
text_to_video(model_manager, "an astronaut riding a horse on Mars.", 0, "1_video_1.mp4")
edit_video(model_manager, "a white robot riding a horse on Mars.", 1, "1_video_1.mp4", "1_video_2.mp4")
self_upscale(model_manager, "a white robot riding a horse on Mars.", 2, "1_video_2.mp4", "1_video_3.mp4")
interpolate_video(model_manager, "1_video_3.mp4", "1_video_4.mp4")
# Example 2
text_to_video(model_manager, "a dog is running.", 1, "2_video_1.mp4")
edit_video(model_manager, "a dog with blue collar.", 2, "2_video_1.mp4", "2_video_2.mp4")
self_upscale(model_manager, "a dog with blue collar.", 3, "2_video_2.mp4", "2_video_3.mp4")
interpolate_video(model_manager, "2_video_3.mp4", "2_video_4.mp4")

View File

@@ -1,91 +0,0 @@
# ControlNet
We provide extensive ControlNet support. Taking the FLUX model as an example, we support many different ControlNet models that can be freely combined, even if their structures differ. Additionally, ControlNet models are compatible with high-resolution refinement and partition control techniques, enabling very powerful controllable image generation.
These examples are in [`flux_controlnet.py`](./flux_controlnet.py).
## Canny/Depth/Normal: Structure Control
Structural control is the most fundamental capability of the ControlNet model. By using Canny to extract edge information, or by utilizing depth maps and normal maps, we can extract the structure of an image, which can then serve as control information during the image generation process.
Model link: https://modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha
For example, if we generate an image of a cat and use a model like InstantX/FLUX.1-dev-Controlnet-Union-alpha that supports multiple control conditions, we can simultaneously enable both Canny and Depth controls to transform the environment into a twilight setting.
|![image_5](https://github.com/user-attachments/assets/19d2abc4-36ae-4163-a8da-df5732d1a737)|![image_6](https://github.com/user-attachments/assets/28378271-3782-484c-bd51-3d3311dd85c6)|
|-|-|
The control strength of ControlNet for structure can be adjusted. For example, in the case below, when we move the girl from summer to winter, we can appropriately lower the control strength of ControlNet so that the model will adapt to the content of the image and change her into warm clothes.
|![image_7](https://github.com/user-attachments/assets/a7b8555b-bfd9-4e92-aa77-16bca81b07e3)|![image_8](https://github.com/user-attachments/assets/a1bab36b-6cce-4f29-8233-4cb824b524a8)|
|-|-|
## Upscaler/Tile/Blur: High-Resolution Image Synthesis
There are many ControlNet models that support high definition, such as:
Model link: https://modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Upscaler, https://modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha, https://modelscope.cn/models/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro
These models can transform blurry, noisy low-quality images into clear ones. In DiffSynth-Studio, the native high-resolution patch processing technology supported by the framework can overcome the resolution limitations of the models, enabling image generation at resolutions of 2048 or even higher, significantly enhancing the capabilities of these models. In the example below, we can see that in the high-definition image enlarged to 2048 resolution, the cat's fur is rendered in exquisite detail, and the skin texture of the characters is delicate and realistic.
|![image_1](https://github.com/user-attachments/assets/9038158a-118c-4ad7-ab01-22865f6a06fc)|![image_2](https://github.com/user-attachments/assets/88583a33-cd74-4cb9-8fd4-c6e14c0ada0c)|
|-|-|
|![image_3](https://github.com/user-attachments/assets/13061ecf-bb57-448a-82c6-7e4655c9cd85)|![image_4](https://github.com/user-attachments/assets/0b7ae80f-de58-4d1d-a49c-ad17e7631bdc)|
|-|-|
## Inpaint: Image Restoration
The Inpaint ControlNet model can repaint specific areas in an image. For example, we can put sunglasses on a cat.
Model link: https://modelscope.cn/models/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta
|![image_9](https://github.com/user-attachments/assets/babddad0-2d67-4624-b77a-c953250ebdab)|![mask_9](https://github.com/user-attachments/assets/d5bc2878-1817-457a-bdfa-200f955233d3)|![image_10](https://github.com/user-attachments/assets/e3197f2c-190b-4522-83ab-a2e0451b39f6)|
|-|-|-|
However, we noticed that the head movements of the cat have changed. If we want to preserve the original structural features, we can use the Canny, Depth, and Normal models. DiffSynth-Studio provides seamless support for ControlNet of different structures. By using a Normal ControlNet, we can ensure that the structure of the image remains unchanged during local redrawing.
Model link: https://modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Surface-Normals
|![image_11](https://github.com/user-attachments/assets/c028e6fc-5125-4cba-b35a-b6211c2e6600)|![mask_11](https://github.com/user-attachments/assets/1928ee9a-7594-4c6e-9c71-5bd0b043d8f4)|![image_12](https://github.com/user-attachments/assets/97b3b9e1-f821-405e-971b-9e1c31a209aa)|
|-|-|-|
## MultiControlNet+MultiDiffusion: Fine-Grained Control
DiffSynth-Studio not only supports the simultaneous activation of multiple ControlNet structures, but also allows for the partitioned control of content within an image using different prompts. Additionally, it supports the chunk processing of ultra-high-resolution large images, enabling us to achieve extremely detailed high-level control. Next, we will showcase the creative process behind a beautiful image.
First, use the prompt "a beautiful Asian woman and a cat on a bed. The woman wears a dress" to generate a cat and a young girl.
![image_13](https://github.com/user-attachments/assets/8da006e4-0e68-4fa5-b407-31ef5dbe8e5a)
Then, enable Inpaint ControlNet and Canny ControlNet.
Model link: https://modelscope.cn/models/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta, https://modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha
We control the image using two component.
|Prompt: an orange cat, highly detailed|Prompt: a girl wearing a red camisole|
|-|-|
|![mask_13_1](https://github.com/user-attachments/assets/188530a0-913c-48db-a7f1-62f0384bfdc3)|![mask_13_2](https://github.com/user-attachments/assets/99c4d0d5-8cc3-47a0-8e56-ceb37db4dfdc)|
Generate!
![image_14](https://github.com/user-attachments/assets/f5b9d3dd-a690-4597-91a8-a019c6fc2523)
The background is a bit blurry, so we use deblurring LoRA for image-to-image generation.
Model link: https://modelscope.cn/models/LiblibAI/FLUX.1-dev-LoRA-AntiBlur
![image_15](https://github.com/user-attachments/assets/32ed2667-2260-4d80-aaa9-4435d6920a2a)
The entire image is much clearer now. Next, let's use the high-definition model to increase the resolution to 4096*4096!
Model link: https://modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Upscaler
![image_17](https://github.com/user-attachments/assets/1a688a12-1544-4973-8aca-aa3a23cb34c1)
Zoom in to see details.
![image_17_cropped](https://github.com/user-attachments/assets/461a1fbc-9ffa-4da5-80fd-e1af9667c804)
Enjoy!

View File

@@ -1,299 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, ControlNetConfigUnit, download_models, download_customized_models
import torch
from PIL import Image
import numpy as np
def example_1():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=["FLUX.1-dev", "jasperai/Flux.1-dev-Controlnet-Upscaler"])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
])
image_1 = pipe(
prompt="a photo of a cat, highly detailed",
height=768, width=768,
seed=0
)
image_1.save("image_1.jpg")
image_2 = pipe(
prompt="a photo of a cat, highly detailed",
controlnet_image=image_1.resize((2048, 2048)),
input_image=image_1.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=1
)
image_2.save("image_2.jpg")
def example_2():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=["FLUX.1-dev", "jasperai/Flux.1-dev-Controlnet-Upscaler"])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
])
image_1 = pipe(
prompt="a beautiful Chinese girl, delicate skin texture",
height=768, width=768,
seed=2
)
image_1.save("image_3.jpg")
image_2 = pipe(
prompt="a beautiful Chinese girl, delicate skin texture",
controlnet_image=image_1.resize((2048, 2048)),
input_image=image_1.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=3
)
image_2.save("image_4.jpg")
def example_3():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=["FLUX.1-dev", "InstantX/FLUX.1-dev-Controlnet-Union-alpha"])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
ControlNetConfigUnit(
processor_id="depth",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
])
image_1 = pipe(
prompt="a cat is running",
height=1024, width=1024,
seed=4
)
image_1.save("image_5.jpg")
image_2 = pipe(
prompt="sunshine, a cat is running",
controlnet_image=image_1,
height=1024, width=1024,
seed=5
)
image_2.save("image_6.jpg")
def example_4():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=["FLUX.1-dev", "InstantX/FLUX.1-dev-Controlnet-Union-alpha"])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
ControlNetConfigUnit(
processor_id="depth",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
])
image_1 = pipe(
prompt="a beautiful Asian girl, full body, red dress, summer",
height=1024, width=1024,
seed=6
)
image_1.save("image_7.jpg")
image_2 = pipe(
prompt="a beautiful Asian girl, full body, red dress, winter",
controlnet_image=image_1,
height=1024, width=1024,
seed=7
)
image_2.save("image_8.jpg")
def example_5():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=["FLUX.1-dev", "alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta"])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
])
image_1 = pipe(
prompt="a cat sitting on a chair",
height=1024, width=1024,
seed=8
)
image_1.save("image_9.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[100:350, 350: -300] = 255
mask = Image.fromarray(mask)
mask.save("mask_9.jpg")
image_2 = pipe(
prompt="a cat sitting on a chair, wearing sunglasses",
controlnet_image=image_1, controlnet_inpaint_mask=mask,
height=1024, width=1024,
seed=9
)
image_2.save("image_10.jpg")
def example_6():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=[
"FLUX.1-dev",
"jasperai/Flux.1-dev-Controlnet-Surface-Normals",
"alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta"
])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
ControlNetConfigUnit(
processor_id="normal",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Surface-Normals/diffusion_pytorch_model.safetensors",
scale=0.6
),
])
image_1 = pipe(
prompt="a beautiful Asian woman looking at the sky, wearing a blue t-shirt.",
height=1024, width=1024,
seed=10
)
image_1.save("image_11.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[-400:, 10:-40] = 255
mask = Image.fromarray(mask)
mask.save("mask_11.jpg")
image_2 = pipe(
prompt="a beautiful Asian woman looking at the sky, wearing a yellow t-shirt.",
controlnet_image=image_1, controlnet_inpaint_mask=mask,
height=1024, width=1024,
seed=11
)
image_2.save("image_12.jpg")
def example_7():
model_manager = ModelManager(torch_dtype=torch.bfloat16, model_id_list=[
"FLUX.1-dev",
"InstantX/FLUX.1-dev-Controlnet-Union-alpha",
"alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta",
"jasperai/Flux.1-dev-Controlnet-Upscaler",
])
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.5
),
])
image_1 = pipe(
prompt="a beautiful Asian woman and a cat on a bed. The woman wears a dress.",
height=1024, width=1024,
seed=100
)
image_1.save("image_13.jpg")
mask_global = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_global = Image.fromarray(mask_global)
mask_global.save("mask_13_global.jpg")
mask_1 = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_1[300:-100, 30: 450] = 255
mask_1 = Image.fromarray(mask_1)
mask_1.save("mask_13_1.jpg")
mask_2 = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_2[500:-100, -400:] = 255
mask_2[-200:-100, -500:-400] = 255
mask_2 = Image.fromarray(mask_2)
mask_2.save("mask_13_2.jpg")
image_2 = pipe(
prompt="a beautiful Asian woman and a cat on a bed. The woman wears a dress.",
controlnet_image=image_1, controlnet_inpaint_mask=mask_global,
local_prompts=["an orange cat, highly detailed", "a girl wearing a red camisole"], masks=[mask_1, mask_2], mask_scales=[10.0, 10.0],
height=1024, width=1024,
seed=101
)
image_2.save("image_14.jpg")
model_manager.load_lora("models/lora/FLUX-dev-lora-AntiBlur.safetensors", lora_alpha=2)
image_3 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. clear background.",
negative_prompt="blur, blurry",
input_image=image_2, denoising_strength=0.7,
height=1024, width=1024,
cfg_scale=2.0, num_inference_steps=50,
seed=102
)
image_3.save("image_15.jpg")
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
])
image_4 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. highly detailed, delicate skin texture, clear background.",
controlnet_image=image_3.resize((2048, 2048)),
input_image=image_3.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=103
)
image_4.save("image_16.jpg")
image_5 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. highly detailed, delicate skin texture, clear background.",
controlnet_image=image_4.resize((4096, 4096)),
input_image=image_4.resize((4096, 4096)), denoising_strength=0.99,
height=4096, width=4096, tiled=True,
seed=104
)
image_5.save("image_17.jpg")
download_models(["Annotators:Depth", "Annotators:Normal"])
download_customized_models(
model_id="LiblibAI/FLUX.1-dev-LoRA-AntiBlur",
origin_file_path="FLUX-dev-lora-AntiBlur.safetensors",
local_dir="models/lora"
)
example_1()
example_2()
example_3()
example_4()
example_5()
example_6()
example_7()

View File

@@ -1,447 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, ControlNetConfigUnit, download_models, download_customized_models
import torch
from PIL import Image
import numpy as np
def example_1():
download_models(["FLUX.1-dev", "jasperai/Flux.1-dev-Controlnet-Upscaler"])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a photo of a cat, highly detailed",
height=768, width=768,
seed=0
)
image_1.save("image_1.jpg")
image_2 = pipe(
prompt="a photo of a cat, highly detailed",
controlnet_image=image_1.resize((2048, 2048)),
input_image=image_1.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=1
)
image_2.save("image_2.jpg")
def example_2():
download_models(["FLUX.1-dev", "jasperai/Flux.1-dev-Controlnet-Upscaler"])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a beautiful Chinese girl, delicate skin texture",
height=768, width=768,
seed=2
)
image_1.save("image_3.jpg")
image_2 = pipe(
prompt="a beautiful Chinese girl, delicate skin texture",
controlnet_image=image_1.resize((2048, 2048)),
input_image=image_1.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=3
)
image_2.save("image_4.jpg")
def example_3():
download_models(["FLUX.1-dev", "InstantX/FLUX.1-dev-Controlnet-Union-alpha"])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
ControlNetConfigUnit(
processor_id="depth",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a cat is running",
height=1024, width=1024,
seed=4
)
image_1.save("image_5.jpg")
image_2 = pipe(
prompt="sunshine, a cat is running",
controlnet_image=image_1,
height=1024, width=1024,
seed=5
)
image_2.save("image_6.jpg")
def example_4():
download_models(["FLUX.1-dev", "InstantX/FLUX.1-dev-Controlnet-Union-alpha"])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
ControlNetConfigUnit(
processor_id="depth",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.3
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a beautiful Asian girl, full body, red dress, summer",
height=1024, width=1024,
seed=6
)
image_1.save("image_7.jpg")
image_2 = pipe(
prompt="a beautiful Asian girl, full body, red dress, winter",
controlnet_image=image_1,
height=1024, width=1024,
seed=7
)
image_2.save("image_8.jpg")
def example_5():
download_models(["FLUX.1-dev", "alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta"])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a cat sitting on a chair",
height=1024, width=1024,
seed=8
)
image_1.save("image_9.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[100:350, 350: -300] = 255
mask = Image.fromarray(mask)
mask.save("mask_9.jpg")
image_2 = pipe(
prompt="a cat sitting on a chair, wearing sunglasses",
controlnet_image=image_1, controlnet_inpaint_mask=mask,
height=1024, width=1024,
seed=9
)
image_2.save("image_10.jpg")
def example_6():
download_models([
"FLUX.1-dev",
"jasperai/Flux.1-dev-Controlnet-Surface-Normals",
"alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta"
])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
"models/ControlNet/jasperai/Flux.1-dev-Controlnet-Surface-Normals/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
ControlNetConfigUnit(
processor_id="normal",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Surface-Normals/diffusion_pytorch_model.safetensors",
scale=0.6
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a beautiful Asian woman looking at the sky, wearing a blue t-shirt.",
height=1024, width=1024,
seed=10
)
image_1.save("image_11.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[-400:, 10:-40] = 255
mask = Image.fromarray(mask)
mask.save("mask_11.jpg")
image_2 = pipe(
prompt="a beautiful Asian woman looking at the sky, wearing a yellow t-shirt.",
controlnet_image=image_1, controlnet_inpaint_mask=mask,
height=1024, width=1024,
seed=11
)
image_2.save("image_12.jpg")
def example_7():
download_models([
"FLUX.1-dev",
"InstantX/FLUX.1-dev-Controlnet-Union-alpha",
"alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta",
"jasperai/Flux.1-dev-Controlnet-Upscaler",
])
model_manager = ModelManager(
torch_dtype=torch.bfloat16,
device="cpu"
)
model_manager.load_models([
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
])
model_manager.load_models(
["models/FLUX/FLUX.1-dev/flux1-dev.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
model_manager.load_models(
["models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
"models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
"models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors"],
torch_dtype=torch.float8_e4m3fn
)
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="inpaint",
model_path="models/ControlNet/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta/diffusion_pytorch_model.safetensors",
scale=0.9
),
ControlNetConfigUnit(
processor_id="canny",
model_path="models/ControlNet/InstantX/FLUX.1-dev-Controlnet-Union-alpha/diffusion_pytorch_model.safetensors",
scale=0.5
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_1 = pipe(
prompt="a beautiful Asian woman and a cat on a bed. The woman wears a dress.",
height=1024, width=1024,
seed=100
)
image_1.save("image_13.jpg")
mask_global = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_global = Image.fromarray(mask_global)
mask_global.save("mask_13_global.jpg")
mask_1 = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_1[300:-100, 30: 450] = 255
mask_1 = Image.fromarray(mask_1)
mask_1.save("mask_13_1.jpg")
mask_2 = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask_2[500:-100, -400:] = 255
mask_2[-200:-100, -500:-400] = 255
mask_2 = Image.fromarray(mask_2)
mask_2.save("mask_13_2.jpg")
image_2 = pipe(
prompt="a beautiful Asian woman and a cat on a bed. The woman wears a dress.",
controlnet_image=image_1, controlnet_inpaint_mask=mask_global,
local_prompts=["an orange cat, highly detailed", "a girl wearing a red camisole"], masks=[mask_1, mask_2], mask_scales=[10.0, 10.0],
height=1024, width=1024,
seed=101
)
image_2.save("image_14.jpg")
model_manager.load_lora("models/lora/FLUX-dev-lora-AntiBlur.safetensors", lora_alpha=2)
image_3 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. clear background.",
negative_prompt="blur, blurry",
input_image=image_2, denoising_strength=0.7,
height=1024, width=1024,
cfg_scale=2.0, num_inference_steps=50,
seed=102
)
image_3.save("image_15.jpg")
pipe = FluxImagePipeline.from_model_manager(model_manager, controlnet_config_units=[
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/jasperai/Flux.1-dev-Controlnet-Upscaler/diffusion_pytorch_model.safetensors",
scale=0.7
),
],device="cuda")
pipe.enable_cpu_offload()
pipe.dit.quantize()
for model in pipe.controlnet.models:
model.quantize()
image_4 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. highly detailed, delicate skin texture, clear background.",
controlnet_image=image_3.resize((2048, 2048)),
input_image=image_3.resize((2048, 2048)), denoising_strength=0.99,
height=2048, width=2048, tiled=True,
seed=103
)
image_4.save("image_16.jpg")
image_5 = pipe(
prompt="a beautiful Asian woman wearing a red camisole and an orange cat on a bed. highly detailed, delicate skin texture, clear background.",
controlnet_image=image_4.resize((4096, 4096)),
input_image=image_4.resize((4096, 4096)), denoising_strength=0.99,
height=4096, width=4096, tiled=True,
seed=104
)
image_5.save("image_17.jpg")
download_models(["Annotators:Depth", "Annotators:Normal"])
download_customized_models(
model_id="LiblibAI/FLUX.1-dev-LoRA-AntiBlur",
origin_file_path="FLUX-dev-lora-AntiBlur.safetensors",
local_dir="models/lora"
)
example_1()
example_2()
example_3()
example_4()
example_5()
example_6()
example_7()

View File

@@ -1,512 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "8ObdI5jCB8xy"
},
"source": [
"# DiffSynth Studio\n",
"\n",
"Welcome to DiffSynth Studio! This is an example of Diffutoon."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XSkKX7O2BwuM"
},
"source": [
"## Install"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "msCpt0pLnT8W",
"outputId": "35d93b35-451b-4760-d1ee-ef7ff190916e"
},
"outputs": [],
"source": [
"!git clone https://github.com/Artiprocher/DiffSynth-Studio.git\n",
"!pip install -q transformers controlnet-aux==0.0.7 streamlit streamlit-drawable-canvas imageio imageio[ffmpeg] safetensors einops cupy-cuda12x\n",
"%cd /content/DiffSynth-Studio"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5eCu_rlKB3kK"
},
"source": [
"## Download Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9znMkpVj3qZ1"
},
"outputs": [],
"source": [
"import requests\n",
"\n",
"\n",
"def download_model(url, file_path):\n",
" model_file = requests.get(url, allow_redirects=True)\n",
" with open(file_path, \"wb\") as f:\n",
" f.write(model_file.content)\n",
"\n",
"download_model(\"https://civitai.com/api/download/models/229575\", \"models/stable_diffusion/aingdiffusion_v12.safetensors\")\n",
"download_model(\"https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt\", \"models/AnimateDiff/mm_sd_v15_v2.ckpt\")\n",
"download_model(\"https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_lineart.pth\", \"models/ControlNet/control_v11p_sd15_lineart.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1e_sd15_tile.pth\", \"models/ControlNet/control_v11f1e_sd15_tile.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1p_sd15_depth.pth\", \"models/ControlNet/control_v11f1p_sd15_depth.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_softedge.pth\", \"models/ControlNet/control_v11p_sd15_softedge.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/Annotators/resolve/main/dpt_hybrid-midas-501f0c75.pt\", \"models/Annotators/dpt_hybrid-midas-501f0c75.pt\")\n",
"download_model(\"https://huggingface.co/lllyasviel/Annotators/resolve/main/ControlNetHED.pth\", \"models/Annotators/ControlNetHED.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model.pth\", \"models/Annotators/sk_model.pth\")\n",
"download_model(\"https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model2.pth\", \"models/Annotators/sk_model2.pth\")\n",
"download_model(\"https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16\", \"models/textual_inversion/verybadimagenegative_v1.3.pt\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iwOq2lWtKVYS"
},
"source": [
"## Run Diffutoon"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tII_XRY-PJeo"
},
"source": [
"### Config Template"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vsd2alA3PrGe"
},
"outputs": [],
"source": [
"config_stage_1_template = {\n",
" \"models\": {\n",
" \"model_list\": [\n",
" \"models/stable_diffusion/aingdiffusion_v12.safetensors\",\n",
" \"models/ControlNet/control_v11p_sd15_softedge.pth\",\n",
" \"models/ControlNet/control_v11f1p_sd15_depth.pth\"\n",
" ],\n",
" \"textual_inversion_folder\": \"models/textual_inversion\",\n",
" \"device\": \"cuda\",\n",
" \"lora_alphas\": [],\n",
" \"controlnet_units\": [\n",
" {\n",
" \"processor_id\": \"softedge\",\n",
" \"model_path\": \"models/ControlNet/control_v11p_sd15_softedge.pth\",\n",
" \"scale\": 0.5\n",
" },\n",
" {\n",
" \"processor_id\": \"depth\",\n",
" \"model_path\": \"models/ControlNet/control_v11f1p_sd15_depth.pth\",\n",
" \"scale\": 0.5\n",
" }\n",
" ]\n",
" },\n",
" \"data\": {\n",
" \"input_frames\": {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 512,\n",
" \"width\": 512,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" },\n",
" \"controlnet_frames\": [\n",
" {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 512,\n",
" \"width\": 512,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" },\n",
" {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 512,\n",
" \"width\": 512,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" }\n",
" ],\n",
" \"output_folder\": \"data/examples/diffutoon_edit/color_video\",\n",
" \"fps\": 25\n",
" },\n",
" \"smoother_configs\": [\n",
" {\n",
" \"processor_type\": \"FastBlend\",\n",
" \"config\": {}\n",
" }\n",
" ],\n",
" \"pipeline\": {\n",
" \"seed\": 0,\n",
" \"pipeline_inputs\": {\n",
" \"prompt\": \"best quality, perfect anime illustration, orange clothes, night, a girl is dancing, smile, solo, black silk stockings\",\n",
" \"negative_prompt\": \"verybadimagenegative_v1.3\",\n",
" \"cfg_scale\": 7.0,\n",
" \"clip_skip\": 1,\n",
" \"denoising_strength\": 0.9,\n",
" \"num_inference_steps\": 20,\n",
" \"animatediff_batch_size\": 8,\n",
" \"animatediff_stride\": 4,\n",
" \"unet_batch_size\": 8,\n",
" \"controlnet_batch_size\": 8,\n",
" \"cross_frame_attention\": True,\n",
" \"smoother_progress_ids\": [-1],\n",
" # The following parameters will be overwritten. You don't need to modify them.\n",
" \"input_frames\": [],\n",
" \"num_frames\": 30,\n",
" \"width\": 512,\n",
" \"height\": 512,\n",
" \"controlnet_frames\": []\n",
" }\n",
" }\n",
"}\n",
"\n",
"config_stage_2_template = {\n",
" \"models\": {\n",
" \"model_list\": [\n",
" \"models/stable_diffusion/aingdiffusion_v12.safetensors\",\n",
" \"models/AnimateDiff/mm_sd_v15_v2.ckpt\",\n",
" \"models/ControlNet/control_v11f1e_sd15_tile.pth\",\n",
" \"models/ControlNet/control_v11p_sd15_lineart.pth\"\n",
" ],\n",
" \"textual_inversion_folder\": \"models/textual_inversion\",\n",
" \"device\": \"cuda\",\n",
" \"lora_alphas\": [],\n",
" \"controlnet_units\": [\n",
" {\n",
" \"processor_id\": \"tile\",\n",
" \"model_path\": \"models/ControlNet/control_v11f1e_sd15_tile.pth\",\n",
" \"scale\": 0.5\n",
" },\n",
" {\n",
" \"processor_id\": \"lineart\",\n",
" \"model_path\": \"models/ControlNet/control_v11p_sd15_lineart.pth\",\n",
" \"scale\": 0.5\n",
" }\n",
" ]\n",
" },\n",
" \"data\": {\n",
" \"input_frames\": {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 1024,\n",
" \"width\": 1024,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" },\n",
" \"controlnet_frames\": [\n",
" {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 1024,\n",
" \"width\": 1024,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" },\n",
" {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 1024,\n",
" \"width\": 1024,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
" }\n",
" ],\n",
" \"output_folder\": \"/content/output\",\n",
" \"fps\": 25\n",
" },\n",
" \"pipeline\": {\n",
" \"seed\": 0,\n",
" \"pipeline_inputs\": {\n",
" \"prompt\": \"best quality, perfect anime illustration, light, a girl is dancing, smile, solo\",\n",
" \"negative_prompt\": \"verybadimagenegative_v1.3\",\n",
" \"cfg_scale\": 7.0,\n",
" \"clip_skip\": 2,\n",
" \"denoising_strength\": 1.0,\n",
" \"num_inference_steps\": 10,\n",
" \"animatediff_batch_size\": 16,\n",
" \"animatediff_stride\": 8,\n",
" \"unet_batch_size\": 1,\n",
" \"controlnet_batch_size\": 1,\n",
" \"cross_frame_attention\": False,\n",
" # The following parameters will be overwritten. You don't need to modify them.\n",
" \"input_frames\": [],\n",
" \"num_frames\": 30,\n",
" \"width\": 1536,\n",
" \"height\": 1536,\n",
" \"controlnet_frames\": []\n",
" }\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "113QAmNHP6T_"
},
"source": [
"### Upload Input Video\n",
"\n",
"Before you run the following code, please upload your input video to `/content/input_video.mp4`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CyqAsj1o5U9B"
},
"source": [
"### Toon Shading\n",
"\n",
"Render your video in an anime style.\n",
"\n",
"We highly recommend you to use a higher resolution for better visual quality. The default resolution of Diffutoon is 1536x1536, which requires 22GB VRAM. If you don't have enough VRAM, 1024x1024 is also acceptable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "761nbrgeKMvj",
"outputId": "c0d47d5f-16e9-4a65-e664-9bd5fc491111"
},
"outputs": [],
"source": [
"from diffsynth import SDVideoPipelineRunner\n",
"\n",
"\n",
"config = config_stage_2_template.copy()\n",
"config[\"data\"][\"input_frames\"] = {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 1024,\n",
" \"width\": 1024,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
"}\n",
"config[\"data\"][\"controlnet_frames\"] = [config[\"data\"][\"input_frames\"], config[\"data\"][\"input_frames\"]]\n",
"config[\"data\"][\"output_folder\"] = \"/content/toon_video\"\n",
"config[\"data\"][\"fps\"] = 25\n",
"\n",
"runner = SDVideoPipelineRunner()\n",
"runner.run(config)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9wujhGUmDIwY"
},
"source": [
"Let's see the video!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 420
},
"id": "TBNAigacAq6h",
"outputId": "8f57c3b4-982b-4643-f3dc-53c51bd85a4b"
},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"from base64 import b64encode\n",
"\n",
"mp4 = open(\"/content/toon_video/video.mp4\", \"rb\").read()\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
"HTML(\"\"\"\n",
"<video width=400 controls>\n",
"<source src=\"%s\" type=\"video/mp4\">\n",
"</video>\n",
"\"\"\" % data_url)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "48hQfX--5YGi"
},
"source": [
"### Toon Shading with Editing Signals"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bAQ9Zq-3-MH6"
},
"source": [
"In stage 1, input your prompt, and diffutoon will generate the editing signals in the format of low-resolution color video."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BtDzYgIq5bgg",
"outputId": "bb27b7b9-7979-4409-f476-f25f0a164ef4"
},
"outputs": [],
"source": [
"from diffsynth import SDVideoPipelineRunner\n",
"\n",
"\n",
"config_stage_1 = config_stage_1_template.copy()\n",
"config_stage_1[\"data\"][\"input_frames\"] = {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 512,\n",
" \"width\": 512,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
"}\n",
"config_stage_1[\"data\"][\"controlnet_frames\"] = [config_stage_1[\"data\"][\"input_frames\"], config_stage_1[\"data\"][\"input_frames\"]]\n",
"config_stage_1[\"data\"][\"output_folder\"] = \"/content/color_video\"\n",
"config_stage_1[\"data\"][\"fps\"] = 25\n",
"config_stage_1[\"pipeline\"][\"pipeline_inputs\"][\"prompt\"] = \"best quality, perfect anime illustration, orange clothes, night, a girl is dancing, smile, solo, black silk stockings\"\n",
"\n",
"runner = SDVideoPipelineRunner()\n",
"runner.run(config_stage_1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D9_AWwhi-pA9"
},
"source": [
"In stage 2, diffutoon will rerender the whole video according to the editing signals."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JFysCk7y51i_",
"outputId": "475050d3-c72e-4e08-b55c-d59ed86b5497"
},
"outputs": [],
"source": [
"from diffsynth import SDVideoPipelineRunner\n",
"\n",
"\n",
"config_stage_2 = config_stage_2_template.copy()\n",
"config_stage_2[\"data\"][\"input_frames\"] = {\n",
" \"video_file\": \"/content/input_video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": 1024,\n",
" \"width\": 1024,\n",
" \"start_frame_id\": 0,\n",
" \"end_frame_id\": 30\n",
"}\n",
"config_stage_2[\"data\"][\"controlnet_frames\"][0] = {\n",
" \"video_file\": \"/content/color_video/video.mp4\",\n",
" \"image_folder\": None,\n",
" \"height\": config_stage_2[\"data\"][\"input_frames\"][\"height\"],\n",
" \"width\": config_stage_2[\"data\"][\"input_frames\"][\"width\"],\n",
" \"start_frame_id\": None,\n",
" \"end_frame_id\": None\n",
"}\n",
"config_stage_2[\"data\"][\"controlnet_frames\"][1] = config[\"data\"][\"input_frames\"]\n",
"config_stage_2[\"data\"][\"output_folder\"] = \"/content/edit_video\"\n",
"config_stage_2[\"data\"][\"fps\"] = 25\n",
"\n",
"runner = SDVideoPipelineRunner()\n",
"runner.run(config)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HIPrCAIS_Im0"
},
"source": [
"Let's see the video!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 420
},
"id": "Y2nz7rew-7VI",
"outputId": "fbcbadc6-4045-4aac-dfb0-80bacec003bf"
},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"from base64 import b64encode\n",
"\n",
"mp4 = open(\"/content/edit_video/video.mp4\", \"rb\").read()\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
"HTML(\"\"\"\n",
"<video width=400 controls>\n",
"<source src=\"%s\" type=\"video/mp4\">\n",
"</video>\n",
"\"\"\" % data_url)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [
"tII_XRY-PJeo"
],
"gpuType": "T4",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -1,21 +0,0 @@
# Diffutoon
[Diffutoon](https://arxiv.org/abs/2401.16224) is a toon shading approach. This approach is adept for rendering high-resoluton videos with rapid motion.
## Example: Toon Shading (Diffutoon)
Directly render realistic videos in a flatten style. In this example, you can easily modify the parameters in the config dict. See [`diffutoon_toon_shading.py`](./diffutoon_toon_shading.py). We also provide [an example on Colab](https://colab.research.google.com/github/Artiprocher/DiffSynth-Studio/blob/main/examples/Diffutoon.ipynb).
https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/b54c05c5-d747-4709-be5e-b39af82404dd
## Example: Toon Shading with Editing Signals (Diffutoon)
This example supports video editing signals. See [`diffutoon_toon_shading_with_editing_signals.py`](./diffutoon_toon_shading_with_editing_signals.py). The editing feature is also supported in the [Colab example](https://colab.research.google.com/github/Artiprocher/DiffSynth-Studio/blob/main/examples/Diffutoon/Diffutoon.ipynb).
https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/20528af5-5100-474a-8cdc-440b9efdd86c
## Example: Toon Shading (in native Python code)
This example is provided for developers. If you don't want to use the config to manage parameters, you can see [`sd_toon_shading.py`](./sd_toon_shading.py) to learn how to use it in native Python code.
https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/607c199b-6140-410b-a111-3e4ffb01142c

View File

@@ -1,100 +0,0 @@
from diffsynth import SDVideoPipelineRunner, download_models
# Download models (automatically)
# `models/stable_diffusion/aingdiffusion_v12.safetensors`: [link](https://civitai.com/api/download/models/229575)
# `models/AnimateDiff/mm_sd_v15_v2.ckpt`: [link](https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt)
# `models/ControlNet/control_v11p_sd15_lineart.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_lineart.pth)
# `models/ControlNet/control_v11f1e_sd15_tile.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1e_sd15_tile.pth)
# `models/Annotators/sk_model.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model.pth)
# `models/Annotators/sk_model2.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model2.pth)
# `models/textual_inversion/verybadimagenegative_v1.3.pt`: [link](https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16)
download_models([
"AingDiffusion_v12",
"AnimateDiff_v2",
"ControlNet_v11p_sd15_lineart",
"ControlNet_v11f1e_sd15_tile",
"TextualInversion_VeryBadImageNegative_v1.3"
])
# The original video in the example is https://www.bilibili.com/video/BV1iG411a7sQ/.
config = {
"models": {
"model_list": [
"models/stable_diffusion/aingdiffusion_v12.safetensors",
"models/AnimateDiff/mm_sd_v15_v2.ckpt",
"models/ControlNet/control_v11f1e_sd15_tile.pth",
"models/ControlNet/control_v11p_sd15_lineart.pth"
],
"textual_inversion_folder": "models/textual_inversion",
"device": "cuda",
"lora_alphas": [],
"controlnet_units": [
{
"processor_id": "tile",
"model_path": "models/ControlNet/control_v11f1e_sd15_tile.pth",
"scale": 0.5
},
{
"processor_id": "lineart",
"model_path": "models/ControlNet/control_v11p_sd15_lineart.pth",
"scale": 0.5
}
]
},
"data": {
"input_frames": {
"video_file": "data/examples/diffutoon/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
},
"controlnet_frames": [
{
"video_file": "data/examples/diffutoon/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
},
{
"video_file": "data/examples/diffutoon/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
}
],
"output_folder": "output",
"fps": 30
},
"pipeline": {
"seed": 0,
"pipeline_inputs": {
"prompt": "best quality, perfect anime illustration, light, a girl is dancing, smile, solo",
"negative_prompt": "verybadimagenegative_v1.3",
"cfg_scale": 7.0,
"clip_skip": 2,
"denoising_strength": 1.0,
"num_inference_steps": 10,
"animatediff_batch_size": 16,
"animatediff_stride": 8,
"unet_batch_size": 1,
"controlnet_batch_size": 1,
"cross_frame_attention": False,
# The following parameters will be overwritten. You don't need to modify them.
"input_frames": [],
"num_frames": 30,
"width": 1536,
"height": 1536,
"controlnet_frames": []
}
}
}
runner = SDVideoPipelineRunner()
runner.run(config)

View File

@@ -1,204 +0,0 @@
from diffsynth import SDVideoPipelineRunner, download_models
import os
# Download models (automatically)
# `models/stable_diffusion/aingdiffusion_v12.safetensors`: [link](https://civitai.com/api/download/models/229575)
# `models/AnimateDiff/mm_sd_v15_v2.ckpt`: [link](https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt)
# `models/ControlNet/control_v11p_sd15_lineart.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_lineart.pth)
# `models/ControlNet/control_v11f1e_sd15_tile.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1e_sd15_tile.pth)
# `models/ControlNet/control_v11f1p_sd15_depth.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1p_sd15_depth.pth)
# `models/ControlNet/control_v11p_sd15_softedge.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_softedge.pth)
# `models/Annotators/dpt_hybrid-midas-501f0c75.pt`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/dpt_hybrid-midas-501f0c75.pt)
# `models/Annotators/ControlNetHED.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/ControlNetHED.pth)
# `models/Annotators/sk_model.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model.pth)
# `models/Annotators/sk_model2.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model2.pth)
# `models/textual_inversion/verybadimagenegative_v1.3.pt`: [link](https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16)
download_models([
"AingDiffusion_v12",
"AnimateDiff_v2",
"ControlNet_v11p_sd15_lineart",
"ControlNet_v11f1e_sd15_tile",
"ControlNet_v11f1p_sd15_depth",
"ControlNet_v11p_sd15_softedge",
"TextualInversion_VeryBadImageNegative_v1.3"
])
# The original video in the example is https://www.bilibili.com/video/BV1zu4y1s7Ec/.
config_stage_1 = {
"models": {
"model_list": [
"models/stable_diffusion/aingdiffusion_v12.safetensors",
"models/ControlNet/control_v11p_sd15_softedge.pth",
"models/ControlNet/control_v11f1p_sd15_depth.pth"
],
"textual_inversion_folder": "models/textual_inversion",
"device": "cuda",
"lora_alphas": [],
"controlnet_units": [
{
"processor_id": "softedge",
"model_path": "models/ControlNet/control_v11p_sd15_softedge.pth",
"scale": 0.5
},
{
"processor_id": "depth",
"model_path": "models/ControlNet/control_v11f1p_sd15_depth.pth",
"scale": 0.5
}
]
},
"data": {
"input_frames": {
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 512,
"width": 512,
"start_frame_id": 0,
"end_frame_id": 30
},
"controlnet_frames": [
{
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 512,
"width": 512,
"start_frame_id": 0,
"end_frame_id": 30
},
{
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 512,
"width": 512,
"start_frame_id": 0,
"end_frame_id": 30
}
],
"output_folder": "output/color_video",
"fps": 25
},
"smoother_configs": [
{
"processor_type": "FastBlend",
"config": {}
}
],
"pipeline": {
"seed": 0,
"pipeline_inputs": {
"prompt": "best quality, perfect anime illustration, orange clothes, night, a girl is dancing, smile, solo, black silk stockings",
"negative_prompt": "verybadimagenegative_v1.3",
"cfg_scale": 7.0,
"clip_skip": 1,
"denoising_strength": 0.9,
"num_inference_steps": 20,
"animatediff_batch_size": 8,
"animatediff_stride": 4,
"unet_batch_size": 8,
"controlnet_batch_size": 8,
"cross_frame_attention": True,
"smoother_progress_ids": [-1],
# The following parameters will be overwritten. You don't need to modify them.
"input_frames": [],
"num_frames": 30,
"width": 512,
"height": 512,
"controlnet_frames": []
}
}
}
config_stage_2 = {
"models": {
"model_list": [
"models/stable_diffusion/aingdiffusion_v12.safetensors",
"models/AnimateDiff/mm_sd_v15_v2.ckpt",
"models/ControlNet/control_v11f1e_sd15_tile.pth",
"models/ControlNet/control_v11p_sd15_lineart.pth"
],
"textual_inversion_folder": "models/textual_inversion",
"device": "cuda",
"lora_alphas": [],
"controlnet_units": [
{
"processor_id": "tile",
"model_path": "models/ControlNet/control_v11f1e_sd15_tile.pth",
"scale": 0.5
},
{
"processor_id": "lineart",
"model_path": "models/ControlNet/control_v11p_sd15_lineart.pth",
"scale": 0.5
}
]
},
"data": {
"input_frames": {
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
},
"controlnet_frames": [
{
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
},
{
"video_file": "data/examples/diffutoon_edit/input_video.mp4",
"image_folder": None,
"height": 1536,
"width": 1536,
"start_frame_id": 0,
"end_frame_id": 30
}
],
"output_folder": "output/edited_video",
"fps": 30
},
"pipeline": {
"seed": 0,
"pipeline_inputs": {
"prompt": "best quality, perfect anime illustration, light, a girl is dancing, smile, solo",
"negative_prompt": "verybadimagenegative_v1.3",
"cfg_scale": 7.0,
"clip_skip": 2,
"denoising_strength": 1.0,
"num_inference_steps": 10,
"animatediff_batch_size": 16,
"animatediff_stride": 8,
"unet_batch_size": 1,
"controlnet_batch_size": 1,
"cross_frame_attention": False,
# The following parameters will be overwritten. You don't need to modify them.
"input_frames": [],
"num_frames": 30,
"width": 1536,
"height": 1536,
"controlnet_frames": []
}
}
}
runner = SDVideoPipelineRunner()
runner.run(config_stage_1)
# Replace the color video with the synthesized video
config_stage_2["data"]["controlnet_frames"][0] = {
"video_file": os.path.join(config_stage_1["data"]["output_folder"], "video.mp4"),
"image_folder": None,
"height": config_stage_2["data"]["input_frames"]["height"],
"width": config_stage_2["data"]["input_frames"]["width"],
"start_frame_id": None,
"end_frame_id": None
}
runner.run(config_stage_2)

View File

@@ -1,65 +0,0 @@
from diffsynth import ModelManager, SDVideoPipeline, ControlNetConfigUnit, VideoData, save_video, download_models
import torch
# Download models (automatically)
# `models/stable_diffusion/flat2DAnimerge_v45Sharp.safetensors`: [link](https://civitai.com/api/download/models/266360?type=Model&format=SafeTensor&size=pruned&fp=fp16)
# `models/AnimateDiff/mm_sd_v15_v2.ckpt`: [link](https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt)
# `models/ControlNet/control_v11p_sd15_lineart.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_lineart.pth)
# `models/ControlNet/control_v11f1e_sd15_tile.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1e_sd15_tile.pth)
# `models/Annotators/sk_model.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model.pth)
# `models/Annotators/sk_model2.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/sk_model2.pth)
# `models/textual_inversion/verybadimagenegative_v1.3.pt`: [link](https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16)
download_models([
"Flat2DAnimerge_v45Sharp",
"AnimateDiff_v2",
"ControlNet_v11p_sd15_lineart",
"ControlNet_v11f1e_sd15_tile",
"TextualInversion_VeryBadImageNegative_v1.3"
])
# Load models
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda")
model_manager.load_models([
"models/stable_diffusion/flat2DAnimerge_v45Sharp.safetensors",
"models/AnimateDiff/mm_sd_v15_v2.ckpt",
"models/ControlNet/control_v11p_sd15_lineart.pth",
"models/ControlNet/control_v11f1e_sd15_tile.pth",
])
pipe = SDVideoPipeline.from_model_manager(
model_manager,
[
ControlNetConfigUnit(
processor_id="lineart",
model_path="models/ControlNet/control_v11p_sd15_lineart.pth",
scale=0.5
),
ControlNetConfigUnit(
processor_id="tile",
model_path="models/ControlNet/control_v11f1e_sd15_tile.pth",
scale=0.5
)
]
)
pipe.prompter.load_textual_inversions(["models/textual_inversion/verybadimagenegative_v1.3.pt"])
# Load video (we only use 60 frames for quick testing)
# The original video is here: https://www.bilibili.com/video/BV19w411A7YJ/
video = VideoData(
video_file="data/examples/bilibili/BV19w411A7YJ.mp4",
height=1024, width=1024)
input_video = [video[i] for i in range(40*60, 41*60)]
# Toon shading (20G VRAM)
torch.manual_seed(0)
output_video = pipe(
prompt="best quality, perfect anime illustration, light, a girl is dancing, smile, solo",
negative_prompt="verybadimagenegative_v1.3",
cfg_scale=3, clip_skip=2,
controlnet_frames=input_video, num_frames=len(input_video),
num_inference_steps=10, height=1024, width=1024,
animatediff_batch_size=32, animatediff_stride=16,
)
# Save video
save_video(output_video, "output_video.mp4", fps=60)

View File

@@ -1,90 +0,0 @@
# EliGen: Entity-Level Controlled Image Generation
## Introduction
We propose EliGen, a novel approach that leverages fine-grained entity-level information to enable precise and controllable text-to-image generation. EliGen excels in tasks such as entity-level controlled image generation and image inpainting, while its applicability is not limited to these areas. Additionally, it can be seamlessly integrated with existing community models, such as the IP-Adpater and In-Cotext LoRA.
* Paper: [EliGen: Entity-Level Controlled Image Generation with Regional Attention](https://arxiv.org/abs/2501.01097)
* Github: [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)
* Model: [ModelScope](https://www.modelscope.cn/models/DiffSynth-Studio/Eligen), [HuggingFace](https://huggingface.co/modelscope/EliGen)
* Online Demo: [ModelScope EliGen Studio](https://www.modelscope.cn/studios/DiffSynth-Studio/EliGen)
* Training Dataset: [EliGen Train Set](https://www.modelscope.cn/datasets/DiffSynth-Studio/EliGenTrainSet)
## Methodology
![regional-attention](https://github.com/user-attachments/assets/bef5ae2b-cc03-404e-b9c8-0c037ac66190)
We introduce a regional attention mechanism within the DiT framework to effectively process the conditions of each entity. This mechanism enables the local prompt associated with each entity to semantically influence specific regions through regional attention. To further enhance the layout control capabilities of EliGen, we meticulously contribute an entity-annotated dataset and fine-tune the model using the LoRA framework.
1. **Regional Attention**: Regional attention is shown in above figure, which can be easily applied to other text-to-image models. Its core principle involves transforming the positional information of each entity into an attention mask, ensuring that the mechanism only affects the designated regions.
2. **Dataset with Entity Annotation**: To construct a dedicated entity control dataset, we start by randomly selecting captions from DiffusionDB and generating the corresponding source image using Flux. Next, we employ Qwen2-VL 72B, recognized for its advanced grounding capabilities among MLLMs, to randomly identify entities within the image. These entities are annotated with local prompts and bounding boxes for precise localization, forming the foundation of our dataset for further training.
3. **Training**: We utilize LoRA (Low-Rank Adaptation) and DeepSpeed to fine-tune regional attention mechanisms using a curated dataset, enabling our EliGen model to achieve effective entity-level control.
## Usage
1. **Entity-Level Controlled Image Generation**
EliGen achieves effective entity-level control results. See [./entity_control.py](./entity_control.py) for usage.
2. **Image Inpainting**
To apply EliGen to image inpainting task, we propose a inpainting fusion pipeline to preserve the non-painting areas while enabling precise, entity-level modifications over inpaining regions.
See [./entity_inpaint.py](./entity_inpaint.py) for usage.
3. **Styled Entity Control**
EliGen can be seamlessly integrated with existing community models. We have provided an example of how to integrate it with the IP-Adpater. See [./entity_control_ipadapter.py](./entity_control_ipadapter.py) for usage.
4. **Entity Transfer**
We have provided an example of how to integrate EliGen with In-Cotext LoRA, which achieves interesting entity transfer results. See [./entity_transfer.py](./entity_transfer.py) for usage.
5. **Play with EliGen using UI**
Run the following command to try interactive UI:
```bash
python apps/gradio/entity_level_control.py
```
## Examples
### Entity-Level Controlled Image Generation
1. The effect of generating images with continuously changing entity positions.
https://github.com/user-attachments/assets/54a048c8-b663-4262-8c40-43c87c266d4b
2. The image generation effect of complex Entity combinations, demonstrating the strong generalization of EliGen. See [./entity_control.py](./entity_control.py) `example_1-6` for generation prompts.
|Entity Conditions|Generated Image|
|-|-|
|![eligen_example_1_mask_0](https://github.com/user-attachments/assets/68cbedc0-32aa-4a8e-99d2-306dbb4620de)|![eligen_example_1_0](https://github.com/user-attachments/assets/c678c4b1-aa19-41df-b612-adc01b8b2009)|
|![eligen_example_2_mask_0](https://github.com/user-attachments/assets/1c6d9445-5022-4d91-ad2e-dc05321883d1)|![eligen_example_2_0](https://github.com/user-attachments/assets/86739945-cb07-4a49-b3b3-3bb65c90d14f)|
|![eligen_example_3_mask_27](https://github.com/user-attachments/assets/5ca4440d-d1db-45dd-b03c-0affefbd9ac3)|![eligen_example_3_27](https://github.com/user-attachments/assets/9160c22a-89ac-4d52-be1d-17ba2d8a67eb)|
|![eligen_example_4_mask_21](https://github.com/user-attachments/assets/26dfde2b-cc9a-4cb3-806a-7f7436d971a7)|![eligen_example_4_21](https://github.com/user-attachments/assets/1fff7346-6a8c-4eb6-986f-4ea848c6b363)|
|![eligen_example_5_mask_0](https://github.com/user-attachments/assets/8ca94e5f-f896-451d-a700-bcdc23689adb)|![eligen_example_5_0](https://github.com/user-attachments/assets/881a9395-6cc2-43e9-89b4-30b8f5437e6d)|
|![eligen_example_6_mask_8](https://github.com/user-attachments/assets/26c95abf-f2b1-44db-92c1-75d02c714c74)|![eligen_example_6_8](https://github.com/user-attachments/assets/8883abde-3fad-4a8b-ade0-ca5b977a290f)|
1. Demonstration of the robustness of EliGen. The following examples are generated using the same prompt but different seeds. Refer to [./entity_control.py](./entity_control.py) `example_7` for the prompts.
|Entity Conditions|Generated Image|
|-|-|
|![eligen_example_7_mask_5](https://github.com/user-attachments/assets/85630237-9d8b-41ea-9bd5-506652c61776)|![eligen_example_7_5](https://github.com/user-attachments/assets/d34b54d2-c59c-4c39-8ab4-c22f155283f1)|
|![eligen_example_7_mask_5](https://github.com/user-attachments/assets/85630237-9d8b-41ea-9bd5-506652c61776)|![eligen_example_7_6](https://github.com/user-attachments/assets/4050a3bf-a089-4f4f-81e0-e3b391cf7ceb)|
![eligen_example_7_mask_5](https://github.com/user-attachments/assets/85630237-9d8b-41ea-9bd5-506652c61776)|![eligen_example_7_7](https://github.com/user-attachments/assets/682feb5e-a27a-4ae4-a800-018b4e0e504c)|
|![eligen_example_7_mask_5](https://github.com/user-attachments/assets/85630237-9d8b-41ea-9bd5-506652c61776)|![eligen_example_7_8](https://github.com/user-attachments/assets/50266950-24b3-426a-ae74-c3ebadb853d9)|
### Image Inpainting
Demonstration of the inpainting mode of EliGen, see [./entity_inpaint.py](./entity_inpaint.py) for generation prompts.
|Inpainting Input|Inpainting Output|
|-|-|
|![inpaint_i1](https://github.com/user-attachments/assets/5ef499f3-3d8a-49cc-8ceb-86af7f5cb9f8)|![inpaint_o1](https://github.com/user-attachments/assets/88fc3bde-0984-4b3c-8ca9-d63de660855b)|
|![inpaint_i2](https://github.com/user-attachments/assets/5f74c710-bf30-4db1-ae40-a1e1995ccef6)|![inpaint_o2](https://github.com/user-attachments/assets/7c3b4857-b774-47ea-b163-34d49e7c976d)|
### Styled Entity Control
Demonstration of the styled entity control results with EliGen and IP-Adapter, see [./entity_control_ipadapter.py](./entity_control_ipadapter.py) for generation prompts.
|Style Reference|Entity Control Variance 1|Entity Control Variance 2|Entity Control Variance 3|
|-|-|-|-|
|![image_1_base](https://github.com/user-attachments/assets/5e2dd3ab-37d3-4f58-8e02-ee2f9b238604)|![result1](https://github.com/user-attachments/assets/0f6711a2-572a-41b3-938a-95deff6d732d)|![result2](https://github.com/user-attachments/assets/ce2e66e5-1fdf-44e8-bca7-555d805a50b1)|![result3](https://github.com/user-attachments/assets/ad2da233-2f7c-4065-ab57-b2d84dc2c0e2)|
We also provide a demo of the styled entity control results with EliGen and specific styled lora, see [./styled_entity_control.py](./styled_entity_control.py) for details. Here is the visualization of EliGen with [Lego dreambooth lora](https://huggingface.co/merve/flux-lego-lora-dreambooth).
|![image_1_base](https://github.com/user-attachments/assets/35fb60f5-48ef-4f22-95d8-f9e732a5f63f)|![result1](https://github.com/user-attachments/assets/441d700f-f0b1-40e0-8848-4db23520972c)|![result2](https://github.com/user-attachments/assets/c8fd4498-3c55-48ab-9abf-3a092a90c878)|![result3](https://github.com/user-attachments/assets/181ba2bb-62cf-41a8-9e3a-20ed8a7a672f)|
|-|-|-|-|
|![image_1_base](https://github.com/user-attachments/assets/70a3f578-8c7e-4b40-954d-8fc94d4f3ae9)|![result1](https://github.com/user-attachments/assets/65670717-6136-4594-84e5-2307fc20753d)|![result2](https://github.com/user-attachments/assets/5ec7a5bd-f2c9-4b2e-8a4e-d2655ec8036c)|![result3](https://github.com/user-attachments/assets/56f00192-9553-45a6-a971-511b9f5b1480)|
### Entity Transfer
Demonstration of the entity transfer results with EliGen and In-Context LoRA, see [./entity_transfer.py](./entity_transfer.py) for generation prompts.
|Entity to Transfer|Transfer Target Image|Transfer Example 1|Transfer Example 2|
|-|-|-|-|
|![source](https://github.com/user-attachments/assets/0d40ef22-0a09-420d-bd5a-bfb93120b60d)|![targe](https://github.com/user-attachments/assets/f6c58ef2-54c1-4d86-8429-dad2eb0e0685)|![result1](https://github.com/user-attachments/assets/05eed2e3-097d-40af-8aae-1e0c75051f32)|![result2](https://github.com/user-attachments/assets/54314d16-244b-411e-8a91-96c500efa5f5)|

View File

@@ -1,83 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
from modelscope import dataset_snapshot_download
from examples.EntityControl.utils import visualize_masks
from PIL import Image
import torch
def example(pipe, seeds, example_id, global_prompt, entity_prompts):
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/eligen/entity_control/example_{example_id}/*.png")
masks = [Image.open(f"./data/examples/eligen/entity_control/example_{example_id}/{i}.png").convert('RGB') for i in range(len(entity_prompts))]
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
for seed in seeds:
# generate image
image = pipe(
prompt=global_prompt,
cfg_scale=3.0,
negative_prompt=negative_prompt,
num_inference_steps=50,
embedded_guidance=3.5,
seed=seed,
height=1024,
width=1024,
eligen_entity_prompts=entity_prompts,
eligen_entity_masks=masks,
)
image.save(f"eligen_example_{example_id}_{seed}.png")
visualize_masks(image, masks, entity_prompts, f"eligen_example_{example_id}_mask_{seed}.png")
# download and load model
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
# set download_from_modelscope = False if you want to download model from huggingface
download_from_modelscope = True
if download_from_modelscope:
model_id = "DiffSynth-Studio/Eligen"
downloading_priority = ["ModelScope"]
else:
model_id = "modelscope/EliGen"
downloading_priority = ["HuggingFace"]
model_manager.load_lora(
download_customized_models(
model_id=model_id,
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control",
downloading_priority=downloading_priority
),
lora_alpha=1
)
pipe = FluxImagePipeline.from_model_manager(model_manager)
# example 1
global_prompt = "A breathtaking beauty of Raja Ampat by the late-night moonlight , one beautiful woman from behind wearing a pale blue long dress with soft glow, sitting at the top of a cliff looking towards the beach,pastell light colors, a group of small distant birds flying in far sky, a boat sailing on the sea, best quality, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, soft pastell colors,ambient occlusion, cozy ambient lighting, masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning, view from above\n"
entity_prompts = ["cliff", "sea", "moon", "sailing boat", "a seated beautiful woman", "pale blue long dress with soft glow"]
example(pipe, [0], 1, global_prompt, entity_prompts)
# example 2
global_prompt = "samurai girl wearing a kimono, she's holding a sword glowing with red flame, her long hair is flowing in the wind, she is looking at a small bird perched on the back of her hand. ultra realist style. maximum image detail. maximum realistic render."
entity_prompts = ["flowing hair", "sword glowing with red flame", "A cute bird", "blue belt"]
example(pipe, [0], 2, global_prompt, entity_prompts)
# example 3
global_prompt = "Image of a neverending staircase up to a mysterious palace in the sky, The ancient palace stood majestically atop a mist-shrouded mountain, sunrise, two traditional monk walk in the stair looking at the sunrise, fog,see-through, best quality, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, photorealistic, concept art, harmony, serenity, tranquility, ambient occlusion, halation, cozy ambient lighting, dynamic lighting,masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning,"
entity_prompts = ["ancient palace", "stone staircase with railings", "a traditional monk", "a traditional monk"]
example(pipe, [27], 3, global_prompt, entity_prompts)
# example 4
global_prompt = "A beautiful girl wearing shirt and shorts in the street, holding a sign 'Entity Control'"
entity_prompts = ["A beautiful girl", "sign 'Entity Control'", "shorts", "shirt"]
example(pipe, [21], 4, global_prompt, entity_prompts)
# example 5
global_prompt = "A captivating, dramatic scene in a painting that exudes mystery and foreboding. A white sky, swirling blue clouds, and a crescent yellow moon illuminate a solitary woman standing near the water's edge. Her long dress flows in the wind, silhouetted against the eerie glow. The water mirrors the fiery sky and moonlight, amplifying the uneasy atmosphere."
entity_prompts = ["crescent yellow moon", "a solitary woman", "water", "swirling blue clouds"]
example(pipe, [0], 5, global_prompt, entity_prompts)
# example 6
global_prompt = "Snow White and the 6 Dwarfs."
entity_prompts = ["Dwarf 1", "Dwarf 2", "Dwarf 3", "Snow White", "Dwarf 4", "Dwarf 5", "Dwarf 6"]
example(pipe, [8], 6, global_prompt, entity_prompts)
# example 7, same prompt with different seeds
seeds = range(5, 9)
global_prompt = "A beautiful woman wearing white dress, holding a mirror, with a warm light background;"
entity_prompts = ["A beautiful woman", "mirror", "necklace", "glasses", "earring", "white dress", "jewelry headpiece"]
example(pipe, seeds, 7, global_prompt, entity_prompts)

View File

@@ -1,46 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
from modelscope import dataset_snapshot_download
from examples.EntityControl.utils import visualize_masks
from PIL import Image
import torch
# download and load model
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev", "InstantX/FLUX.1-dev-IP-Adapter"])
model_manager.load_lora(
download_customized_models(
model_id="DiffSynth-Studio/Eligen",
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control"
),
lora_alpha=1
)
pipe = FluxImagePipeline.from_model_manager(model_manager)
# download and load mask images
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern="data/examples/eligen/ipadapter/*")
masks = [Image.open(f"./data/examples/eligen/ipadapter/ipadapter_mask_{i}.png") for i in range(1, 4)]
entity_prompts = ['A girl', 'hat', 'sunset']
global_prompt = "A girl wearing a hat, looking at the sunset"
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw"
reference_img = Image.open("./data/examples/eligen/ipadapter/ipadapter_image.png")
# generate image
image = pipe(
prompt=global_prompt,
cfg_scale=3.0,
negative_prompt=negative_prompt,
num_inference_steps=50,
embedded_guidance=3.5,
seed=4,
height=1024,
width=1024,
eligen_entity_prompts=entity_prompts,
eligen_entity_masks=masks,
enable_eligen_on_negative=False,
ipadapter_images=[reference_img],
ipadapter_scale=0.7
)
image.save(f"styled_entity_control.png")
visualize_masks(image, masks, entity_prompts, f"styled_entity_control_with_mask.png")

View File

@@ -1,45 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
from modelscope import dataset_snapshot_download
from examples.EntityControl.utils import visualize_masks
from PIL import Image
import torch
# download and load model
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
model_manager.load_lora(
download_customized_models(
model_id="DiffSynth-Studio/Eligen",
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control"
),
lora_alpha=1
)
pipe = FluxImagePipeline.from_model_manager(model_manager)
# download and load mask images
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern="data/examples/eligen/inpaint/*")
masks = [Image.open(f"./data/examples/eligen/inpaint/inpaint_mask_{i}.png") for i in range(1, 3)]
input_image = Image.open("./data/examples/eligen/inpaint/inpaint_image.jpg")
entity_prompts = ["A person wear red shirt", "Airplane"]
global_prompt = "A person walking on the path in front of a house; An airplane in the sky"
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw, blur"
# generate image
image = pipe(
prompt=global_prompt,
input_image=input_image,
cfg_scale=3.0,
negative_prompt=negative_prompt,
num_inference_steps=50,
embedded_guidance=3.5,
seed=0,
height=1024,
width=1024,
eligen_entity_prompts=entity_prompts,
eligen_entity_masks=masks,
enable_eligen_on_negative=False,
enable_eligen_inpaint=True,
)
image.save(f"entity_inpaint.png")
visualize_masks(image, masks, entity_prompts, f"entity_inpaint_with_mask.png")

View File

@@ -1,84 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
from modelscope import dataset_snapshot_download
from examples.EntityControl.utils import visualize_masks
from PIL import Image
import torch
def build_pipeline():
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
model_manager.load_lora(
download_customized_models(
model_id="DiffSynth-Studio/Eligen",
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control"
),
lora_alpha=1
)
model_manager.load_lora(
download_customized_models(
model_id="iic/In-Context-LoRA",
origin_file_path="visual-identity-design.safetensors",
local_dir="models/lora/In-Context-LoRA"
),
lora_alpha=1
)
pipe = FluxImagePipeline.from_model_manager(model_manager)
return pipe
def generate(pipe: FluxImagePipeline, source_image, target_image, mask, height, width, prompt, entity_prompt, image_save_path, mask_save_path, seed=0):
input_mask = Image.new('RGB', (width * 2, height))
input_mask.paste(mask.resize((width, height), resample=Image.NEAREST).convert('RGB'), (width, 0))
input_image = Image.new('RGB', (width * 2, height))
input_image.paste(source_image.resize((width, height)).convert('RGB'), (0, 0))
input_image.paste(target_image.resize((width, height)).convert('RGB'), (width, 0))
image = pipe(
prompt=prompt,
input_image=input_image,
cfg_scale=3.0,
negative_prompt="",
num_inference_steps=50,
embedded_guidance=3.5,
seed=seed,
height=height,
width=width * 2,
eligen_entity_prompts=[entity_prompt],
eligen_entity_masks=[input_mask],
enable_eligen_on_negative=False,
enable_eligen_inpaint=True,
)
target_image = image.crop((width, 0, 2 * width, height))
target_image.save(image_save_path)
visualize_masks(target_image, [mask], [entity_prompt], mask_save_path)
return target_image
pipe = build_pipeline()
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern="data/examples/eligen/logo_transfer/*")
prompt="The two-panel image showcases the joyful identity, with the left panel showing a rabbit graphic; [LEFT] while the right panel translates the design onto a shopping tote with the rabbit logo in black, held by a person in a market setting, emphasizing the brand's approachable and eco-friendly vibe."
logo_prompt="a rabbit logo"
logo_image = Image.open("data/examples/eligen/logo_transfer/source_image.png")
target_image = Image.open("data/examples/eligen/logo_transfer/target_image.png")
mask = Image.open("data/examples/eligen/logo_transfer/mask_1.png")
generate(
pipe, logo_image, target_image, mask,
height=1024, width=1024,
prompt=prompt, entity_prompt=logo_prompt,
image_save_path="entity_transfer_1.png",
mask_save_path="entity_transfer_with_mask_1.png"
)
mask = Image.open("data/examples/eligen/logo_transfer/mask_2.png")
generate(
pipe, logo_image, target_image, mask,
height=1024, width=1024,
prompt=prompt, entity_prompt=logo_prompt,
image_save_path="entity_transfer_2.png",
mask_save_path="entity_transfer_with_mask_2.png"
)

View File

@@ -1,90 +0,0 @@
from diffsynth import ModelManager, FluxImagePipeline, download_customized_models
from modelscope import dataset_snapshot_download
from examples.EntityControl.utils import visualize_masks
from PIL import Image
import torch
def example(pipe, seeds, example_id, global_prompt, entity_prompts):
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/eligen/entity_control/example_{example_id}/*.png")
masks = [Image.open(f"./data/examples/eligen/entity_control/example_{example_id}/{i}.png").convert('RGB') for i in range(len(entity_prompts))]
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"
for seed in seeds:
# generate image
image = pipe(
prompt=global_prompt,
cfg_scale=3.0,
negative_prompt=negative_prompt,
num_inference_steps=50,
embedded_guidance=3.5,
seed=seed,
height=1024,
width=1024,
eligen_entity_prompts=entity_prompts,
eligen_entity_masks=masks,
)
image.save(f"styled_eligen_example_{example_id}_{seed}.png")
visualize_masks(image, masks, entity_prompts, f"styled_entity_control_example_{example_id}_mask_{seed}.png")
# download and load model
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
model_manager.load_lora(
download_customized_models(
model_id="FluxLora/merve-flux-lego-lora-dreambooth",
origin_file_path="pytorch_lora_weights.safetensors",
local_dir="models/lora/merve-flux-lego-lora-dreambooth"
),
lora_alpha=1
)
model_manager.load_lora(
download_customized_models(
model_id="DiffSynth-Studio/Eligen",
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control"
),
lora_alpha=1
)
pipe = FluxImagePipeline.from_model_manager(model_manager)
# example 1
trigger_word = "lego set in style of TOK, "
global_prompt = "A breathtaking beauty of Raja Ampat by the late-night moonlight , one beautiful woman from behind wearing a pale blue long dress with soft glow, sitting at the top of a cliff looking towards the beach,pastell light colors, a group of small distant birds flying in far sky, a boat sailing on the sea, best quality, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, soft pastell colors,ambient occlusion, cozy ambient lighting, masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning, view from above\n"
global_prompt = trigger_word + global_prompt
entity_prompts = ["cliff", "sea", "moon", "sailing boat", "a seated beautiful woman", "pale blue long dress with soft glow"]
example(pipe, [0], 1, global_prompt, entity_prompts)
# example 2
global_prompt = "samurai girl wearing a kimono, she's holding a sword glowing with red flame, her long hair is flowing in the wind, she is looking at a small bird perched on the back of her hand. ultra realist style. maximum image detail. maximum realistic render."
global_prompt = trigger_word + global_prompt
entity_prompts = ["flowing hair", "sword glowing with red flame", "A cute bird", "blue belt"]
example(pipe, [0], 2, global_prompt, entity_prompts)
# example 3
global_prompt = "Image of a neverending staircase up to a mysterious palace in the sky, The ancient palace stood majestically atop a mist-shrouded mountain, sunrise, two traditional monk walk in the stair looking at the sunrise, fog,see-through, best quality, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, photorealistic, concept art, harmony, serenity, tranquility, ambient occlusion, halation, cozy ambient lighting, dynamic lighting,masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning,"
global_prompt = trigger_word + global_prompt
entity_prompts = ["ancient palace", "stone staircase with railings", "a traditional monk", "a traditional monk"]
example(pipe, [27], 3, global_prompt, entity_prompts)
# example 4
global_prompt = "A beautiful girl wearing shirt and shorts in the street, holding a sign 'Entity Control'"
global_prompt = trigger_word + global_prompt
entity_prompts = ["A beautiful girl", "sign 'Entity Control'", "shorts", "shirt"]
example(pipe, [21], 4, global_prompt, entity_prompts)
# example 5
global_prompt = "A captivating, dramatic scene in a painting that exudes mystery and foreboding. A white sky, swirling blue clouds, and a crescent yellow moon illuminate a solitary woman standing near the water's edge. Her long dress flows in the wind, silhouetted against the eerie glow. The water mirrors the fiery sky and moonlight, amplifying the uneasy atmosphere."
global_prompt = trigger_word + global_prompt
entity_prompts = ["crescent yellow moon", "a solitary woman", "water", "swirling blue clouds"]
example(pipe, [0], 5, global_prompt, entity_prompts)
# example 6
global_prompt = "Snow White and the 6 Dwarfs."
global_prompt = trigger_word + global_prompt
entity_prompts = ["Dwarf 1", "Dwarf 2", "Dwarf 3", "Snow White", "Dwarf 4", "Dwarf 5", "Dwarf 6"]
example(pipe, [8], 6, global_prompt, entity_prompts)
# example 7, same prompt with different seeds
seeds = range(5, 9)
global_prompt = "A beautiful woman wearing white dress, holding a mirror, with a warm light background;"
global_prompt = trigger_word + global_prompt
entity_prompts = ["A beautiful woman", "mirror", "necklace", "glasses", "earring", "white dress", "jewelry headpiece"]
example(pipe, seeds, 7, global_prompt, entity_prompts)

View File

@@ -1,59 +0,0 @@
from PIL import Image, ImageDraw, ImageFont
import random
def visualize_masks(image, masks, mask_prompts, output_path, font_size=35, use_random_colors=False):
# Create a blank image for overlays
overlay = Image.new('RGBA', image.size, (0, 0, 0, 0))
colors = [
(165, 238, 173, 80),
(76, 102, 221, 80),
(221, 160, 77, 80),
(204, 93, 71, 80),
(145, 187, 149, 80),
(134, 141, 172, 80),
(157, 137, 109, 80),
(153, 104, 95, 80),
(165, 238, 173, 80),
(76, 102, 221, 80),
(221, 160, 77, 80),
(204, 93, 71, 80),
(145, 187, 149, 80),
(134, 141, 172, 80),
(157, 137, 109, 80),
(153, 104, 95, 80),
]
# Generate random colors for each mask
if use_random_colors:
colors = [(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255), 80) for _ in range(len(masks))]
# Font settings
try:
font = ImageFont.truetype("arial", font_size) # Adjust as needed
except IOError:
font = ImageFont.load_default(font_size)
# Overlay each mask onto the overlay image
for mask, mask_prompt, color in zip(masks, mask_prompts, colors):
# Convert mask to RGBA mode
mask_rgba = mask.convert('RGBA')
mask_data = mask_rgba.getdata()
new_data = [(color if item[:3] == (255, 255, 255) else (0, 0, 0, 0)) for item in mask_data]
mask_rgba.putdata(new_data)
# Draw the mask prompt text on the mask
draw = ImageDraw.Draw(mask_rgba)
mask_bbox = mask.getbbox() # Get the bounding box of the mask
text_position = (mask_bbox[0] + 10, mask_bbox[1] + 10) # Adjust text position based on mask position
draw.text(text_position, mask_prompt, fill=(255, 255, 255, 255), font=font)
# Alpha composite the overlay with this mask
overlay = Image.alpha_composite(overlay, mask_rgba)
# Composite the overlay onto the original image
result = Image.alpha_composite(image.convert('RGBA'), overlay)
# Save or display the resulting image
result.save(output_path)
return result

View File

@@ -1,21 +0,0 @@
from diffsynth import ModelManager, CogVideoPipeline, save_video, download_models
import torch
download_models(["CogVideoX-5B", "ExVideo-CogVideoX-LoRA-129f-v1"])
model_manager = ModelManager(torch_dtype=torch.bfloat16)
model_manager.load_models([
"models/CogVideo/CogVideoX-5b/text_encoder",
"models/CogVideo/CogVideoX-5b/transformer",
"models/CogVideo/CogVideoX-5b/vae/diffusion_pytorch_model.safetensors",
])
model_manager.load_lora("models/lora/ExVideo-CogVideoX-LoRA-129f-v1.safetensors")
pipe = CogVideoPipeline.from_model_manager(model_manager)
torch.manual_seed(6)
video = pipe(
prompt="an astronaut riding a horse on Mars.",
height=480, width=720, num_frames=129,
cfg_scale=7.0, num_inference_steps=100,
)
save_video(video, "video_with_lora.mp4", fps=8, quality=5)

View File

@@ -1,64 +0,0 @@
import torch, os, argparse
from safetensors.torch import save_file
def load_pl_state_dict(file_path):
print(f"loading {file_path}")
state_dict = torch.load(file_path, map_location="cpu")
trainable_param_names = set(state_dict["trainable_param_names"])
if "module" in state_dict:
state_dict = state_dict["module"]
if "state_dict" in state_dict:
state_dict = state_dict["state_dict"]
state_dict_ = {}
for name, param in state_dict.items():
if name.startswith("_forward_module."):
name = name[len("_forward_module."):]
if name.startswith("unet."):
name = name[len("unet."):]
if name in trainable_param_names:
state_dict_[name] = param
return state_dict_
def ckpt_to_epochs(ckpt_name):
return int(ckpt_name.split("=")[1].split("-")[0])
def parse_args():
parser = argparse.ArgumentParser(description="Simple example of a training script.")
parser.add_argument(
"--output_path",
type=str,
default="./",
help="Path to save the model.",
)
parser.add_argument(
"--gamma",
type=float,
default=0.9,
help="Gamma in EMA.",
)
args = parser.parse_args()
return args
if __name__ == '__main__':
# args
args = parse_args()
folder = args.output_path
gamma = args.gamma
# EMA
ckpt_list = sorted([(ckpt_to_epochs(ckpt_name), ckpt_name) for ckpt_name in os.listdir(folder) if os.path.isdir(f"{folder}/{ckpt_name}")])
state_dict_ema = None
for epochs, ckpt_name in ckpt_list:
state_dict = load_pl_state_dict(f"{folder}/{ckpt_name}/checkpoint/mp_rank_00_model_states.pt")
if state_dict_ema is None:
state_dict_ema = {name: param.float() for name, param in state_dict.items()}
else:
for name, param in state_dict.items():
state_dict_ema[name] = state_dict_ema[name] * gamma + param.float() * (1 - gamma)
save_path = ckpt_name.replace(".ckpt", "-ema.safetensors")
print(f"save to {folder}/{save_path}")
save_file(state_dict_ema, f"{folder}/{save_path}")

View File

@@ -1,114 +0,0 @@
from diffsynth import save_video, ModelManager, SVDVideoPipeline, HunyuanDiTImagePipeline, download_models
from diffsynth import ModelManager
import torch, os
# The models will be downloaded automatically.
# You can also use the following urls to download them manually.
# Download models (from Huggingface)
# Text-to-image model:
# `models/HunyuanDiT/t2i/clip_text_encoder/pytorch_model.bin`: [link](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/resolve/main/t2i/clip_text_encoder/pytorch_model.bin)
# `models/HunyuanDiT/t2i/mt5/pytorch_model.bin`: [link](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/resolve/main/t2i/mt5/pytorch_model.bin)
# `models/HunyuanDiT/t2i/model/pytorch_model_ema.pt`: [link](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/resolve/main/t2i/model/pytorch_model_ema.pt)
# `models/HunyuanDiT/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin`: [link](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/resolve/main/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin)
# Stable Video Diffusion model:
# `models/stable_video_diffusion/svd_xt.safetensors`: [link](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/svd_xt.safetensors)
# ExVideo extension blocks:
# `models/stable_video_diffusion/model.fp16.safetensors`: [link](https://huggingface.co/ECNU-CILab/ExVideo-SVD-128f-v1/resolve/main/model.fp16.safetensors)
# Download models (from Modelscope)
# Text-to-image model:
# `models/HunyuanDiT/t2i/clip_text_encoder/pytorch_model.bin`: [link](https://www.modelscope.cn/api/v1/models/modelscope/HunyuanDiT/repo?Revision=master&FilePath=t2i%2Fclip_text_encoder%2Fpytorch_model.bin)
# `models/HunyuanDiT/t2i/mt5/pytorch_model.bin`: [link](https://www.modelscope.cn/api/v1/models/modelscope/HunyuanDiT/repo?Revision=master&FilePath=t2i%2Fmt5%2Fpytorch_model.bin)
# `models/HunyuanDiT/t2i/model/pytorch_model_ema.pt`: [link](https://www.modelscope.cn/api/v1/models/modelscope/HunyuanDiT/repo?Revision=master&FilePath=t2i%2Fmodel%2Fpytorch_model_ema.pt)
# `models/HunyuanDiT/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin`: [link](https://www.modelscope.cn/api/v1/models/modelscope/HunyuanDiT/repo?Revision=master&FilePath=t2i%2Fsdxl-vae-fp16-fix%2Fdiffusion_pytorch_model.bin)
# Stable Video Diffusion model:
# `models/stable_video_diffusion/svd_xt.safetensors`: [link](https://www.modelscope.cn/api/v1/models/AI-ModelScope/stable-video-diffusion-img2vid-xt/repo?Revision=master&FilePath=svd_xt.safetensors)
# ExVideo extension blocks:
# `models/stable_video_diffusion/model.fp16.safetensors`: [link](https://modelscope.cn/api/v1/models/ECNU-CILab/ExVideo-SVD-128f-v1/repo?Revision=master&FilePath=model.fp16.safetensors)
def generate_image():
# Load models
os.environ["TOKENIZERS_PARALLELISM"] = "True"
download_models(["HunyuanDiT"])
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
file_path_list=[
"models/HunyuanDiT/t2i/clip_text_encoder/pytorch_model.bin",
"models/HunyuanDiT/t2i/mt5/pytorch_model.bin",
"models/HunyuanDiT/t2i/model/pytorch_model_ema.pt",
"models/HunyuanDiT/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin",
])
pipe = HunyuanDiTImagePipeline.from_model_manager(model_manager)
# Generate an image
torch.manual_seed(0)
image = pipe(
prompt="bonfire, on the stone",
negative_prompt="错误的眼睛,糟糕的人脸,毁容,糟糕的艺术,变形,多余的肢体,模糊的颜色,模糊,重复,病态,残缺,",
num_inference_steps=50, height=1024, width=1024,
)
model_manager.to("cpu")
return image
def generate_video(image):
# Load models
download_models(["stable-video-diffusion-img2vid-xt", "ExVideo-SVD-128f-v1"])
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
file_path_list=[
"models/stable_video_diffusion/svd_xt.safetensors",
"models/stable_video_diffusion/model.fp16.safetensors",
])
pipe = SVDVideoPipeline.from_model_manager(model_manager)
# Generate a video
torch.manual_seed(1)
video = pipe(
input_image=image.resize((512, 512)),
num_frames=128, fps=30, height=512, width=512,
motion_bucket_id=127,
num_inference_steps=50,
min_cfg_scale=2, max_cfg_scale=2, contrast_enhance_scale=1.2
)
model_manager.to("cpu")
return video
def upscale_video(image, video):
# Load models
download_models(["stable-video-diffusion-img2vid-xt", "ExVideo-SVD-128f-v1"])
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
file_path_list=[
"models/stable_video_diffusion/svd_xt.safetensors",
"models/stable_video_diffusion/model.fp16.safetensors",
])
pipe = SVDVideoPipeline.from_model_manager(model_manager)
# Generate a video
torch.manual_seed(2)
video = pipe(
input_image=image.resize((1024, 1024)),
input_video=[frame.resize((1024, 1024)) for frame in video], denoising_strength=0.5,
num_frames=128, fps=30, height=1024, width=1024,
motion_bucket_id=127,
num_inference_steps=25,
min_cfg_scale=2, max_cfg_scale=2, contrast_enhance_scale=1.2
)
model_manager.to("cpu")
return video
# We use Hunyuan DiT to generate the first frame. 10GB VRAM is required.
# If you want to use your own image,
# please use `image = Image.open("your_image_file.png")` to replace the following code.
image = generate_image()
image.save("image.png")
# Now, generate a video with resolution of 512. 20GB VRAM is required.
video = generate_video(image)
save_video(video, "video_512.mp4", fps=30)
# Upscale the video. 52GB VRAM is required.
video = upscale_video(image, video)
save_video(video, "video_1024.mp4", fps=30)

View File

@@ -1,364 +0,0 @@
import torch, json, os, imageio, argparse
from torchvision.transforms import v2
import numpy as np
from einops import rearrange, repeat
import lightning as pl
from diffsynth import ModelManager, SVDImageEncoder, SVDUNet, SVDVAEEncoder, ContinuousODEScheduler, load_state_dict
from diffsynth.pipelines.svd_video import SVDCLIPImageProcessor
from diffsynth.models.svd_unet import TemporalAttentionBlock
class TextVideoDataset(torch.utils.data.Dataset):
def __init__(self, base_path, metadata_path, steps_per_epoch=10000, training_shapes=[(128, 1, 128, 512, 512)]):
with open(metadata_path, "r") as f:
metadata = json.load(f)
self.path = [os.path.join(base_path, i["path"]) for i in metadata]
self.steps_per_epoch = steps_per_epoch
self.training_shapes = training_shapes
self.frame_process = []
for max_num_frames, interval, num_frames, height, width in training_shapes:
self.frame_process.append(v2.Compose([
v2.Resize(size=max(height, width), antialias=True),
v2.CenterCrop(size=(height, width)),
v2.Normalize(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5]),
]))
def load_frames_using_imageio(self, file_path, max_num_frames, start_frame_id, interval, num_frames, frame_process):
reader = imageio.get_reader(file_path)
if reader.count_frames() < max_num_frames or reader.count_frames() - 1 < start_frame_id + (num_frames - 1) * interval:
reader.close()
return None
frames = []
for frame_id in range(num_frames):
frame = reader.get_data(start_frame_id + frame_id * interval)
frame = torch.tensor(frame, dtype=torch.float32)
frame = rearrange(frame, "H W C -> 1 C H W")
frame = frame_process(frame)
frames.append(frame)
reader.close()
frames = torch.concat(frames, dim=0)
frames = rearrange(frames, "T C H W -> C T H W")
return frames
def load_video(self, file_path, training_shape_id):
data = {}
max_num_frames, interval, num_frames, height, width = self.training_shapes[training_shape_id]
frame_process = self.frame_process[training_shape_id]
start_frame_id = torch.randint(0, max_num_frames - (num_frames - 1) * interval, (1,))[0]
frames = self.load_frames_using_imageio(file_path, max_num_frames, start_frame_id, interval, num_frames, frame_process)
if frames is None:
return None
else:
data[f"frames_{training_shape_id}"] = frames
return data
def __getitem__(self, index):
video_data = {}
for training_shape_id in range(len(self.training_shapes)):
while True:
data_id = torch.randint(0, len(self.path), (1,))[0]
data_id = (data_id + index) % len(self.path) # For fixed seed.
video_file = self.path[data_id]
try:
data = self.load_video(video_file, training_shape_id)
except:
data = None
if data is not None:
break
video_data.update(data)
return video_data
def __len__(self):
return self.steps_per_epoch
class MotionBucketManager:
def __init__(self):
self.thresholds = [
0.000000000, 0.012205946, 0.015117834, 0.018080613, 0.020614484, 0.021959992, 0.024088068, 0.026323952,
0.028277775, 0.029968588, 0.031836554, 0.033596724, 0.035121530, 0.037200287, 0.038914755, 0.040696491,
0.042368013, 0.044265781, 0.046311017, 0.048243891, 0.050294187, 0.052142400, 0.053634230, 0.055612389,
0.057594258, 0.059410289, 0.061283995, 0.063603796, 0.065192916, 0.067146860, 0.069066539, 0.070390493,
0.072588451, 0.073959745, 0.075889029, 0.077695683, 0.079783581, 0.082162730, 0.084092639, 0.085958421,
0.087700523, 0.089684933, 0.091688842, 0.093335517, 0.094987206, 0.096664011, 0.098314710, 0.100262381,
0.101984538, 0.103404313, 0.105280340, 0.106974818, 0.109028399, 0.111164779, 0.113065213, 0.114362158,
0.116407216, 0.118063427, 0.119524263, 0.121835820, 0.124242283, 0.126202747, 0.128989249, 0.131672353,
0.133417681, 0.135567948, 0.137313649, 0.139189199, 0.140912935, 0.143525436, 0.145718485, 0.148315132,
0.151039496, 0.153218940, 0.155252382, 0.157651082, 0.159966752, 0.162195817, 0.164811596, 0.167341709,
0.170251891, 0.172651157, 0.175550997, 0.178372145, 0.181039348, 0.183565900, 0.186599866, 0.190071866,
0.192574754, 0.195026234, 0.198099136, 0.200210452, 0.202522039, 0.205410406, 0.208610669, 0.211623028,
0.214723110, 0.218520239, 0.222194016, 0.225363150, 0.229384825, 0.233422622, 0.237012610, 0.240735114,
0.243622541, 0.247465774, 0.252190471, 0.257356376, 0.261856794, 0.266556412, 0.271076709, 0.277361482,
0.281250387, 0.286582440, 0.291158527, 0.296712339, 0.303008437, 0.311793238, 0.318485111, 0.326999635,
0.332138240, 0.341770738, 0.354188830, 0.365194678, 0.379234344, 0.401538879, 0.416078776, 0.440871328,
]
def get_motion_score(self, frames):
score = frames.std(dim=2).mean(dim=[1, 2, 3]).tolist()
return score
def get_bucket_id(self, motion_score):
for bucket_id in range(len(self.thresholds) - 1):
if self.thresholds[bucket_id + 1] > motion_score:
return bucket_id
return len(self.thresholds) - 1
def __call__(self, frames):
scores = self.get_motion_score(frames)
bucket_ids = [self.get_bucket_id(score) for score in scores]
return bucket_ids
class LightningModel(pl.LightningModule):
def __init__(self, learning_rate=1e-5, svd_ckpt_path=None, add_positional_conv=128, contrast_enhance_scale=1.01):
super().__init__()
state_dict = load_state_dict(svd_ckpt_path)
self.image_encoder = SVDImageEncoder().to(dtype=torch.float16, device=self.device)
self.image_encoder.load_state_dict(SVDImageEncoder.state_dict_converter().from_civitai(state_dict))
self.image_encoder.eval()
self.image_encoder.requires_grad_(False)
self.unet = SVDUNet(add_positional_conv=add_positional_conv).to(dtype=torch.float16, device=self.device)
self.unet.load_state_dict(SVDUNet.state_dict_converter().from_civitai(state_dict, add_positional_conv=add_positional_conv), strict=False)
self.unet.train()
self.unet.requires_grad_(False)
for block in self.unet.blocks:
if isinstance(block, TemporalAttentionBlock):
block.requires_grad_(True)
self.vae_encoder = SVDVAEEncoder().to(dtype=torch.float16, device=self.device)
self.vae_encoder.load_state_dict(SVDVAEEncoder.state_dict_converter().from_civitai(state_dict))
self.vae_encoder.eval()
self.vae_encoder.requires_grad_(False)
self.noise_scheduler = ContinuousODEScheduler(num_inference_steps=1000)
self.learning_rate = learning_rate
self.motion_bucket_manager = MotionBucketManager()
self.contrast_enhance_scale = contrast_enhance_scale
def encode_image_with_clip(self, image):
image = SVDCLIPImageProcessor().resize_with_antialiasing(image, (224, 224))
image = (image + 1.0) / 2.0
mean = torch.tensor([0.48145466, 0.4578275, 0.40821073]).reshape(1, 3, 1, 1).to(device=self.device, dtype=self.dtype)
std = torch.tensor([0.26862954, 0.26130258, 0.27577711]).reshape(1, 3, 1, 1).to(device=self.device, dtype=self.dtype)
image = (image - mean) / std
image_emb = self.image_encoder(image)
return image_emb
def encode_video_with_vae(self, video):
video = video.to(device=self.device, dtype=self.dtype)
video = video.unsqueeze(0)
latents = self.vae_encoder.encode_video(video)
latents = rearrange(latents[0], "C T H W -> T C H W")
return latents
def tensor2video(self, frames):
frames = rearrange(frames, "C T H W -> T H W C")
frames = ((frames.float() + 1) * 127.5).clip(0, 255).cpu().numpy().astype(np.uint8)
return frames
def calculate_loss(self, frames):
with torch.no_grad():
# Call video encoder
latents = self.encode_video_with_vae(frames)
image_emb_vae = repeat(latents[0] / self.vae_encoder.scaling_factor, "C H W -> T C H W", T=frames.shape[1])
image_emb_clip = self.encode_image_with_clip(frames[:,0].unsqueeze(0))
# Call scheduler
timestep = torch.randint(0, len(self.noise_scheduler.timesteps), (1,))[0]
timestep = self.noise_scheduler.timesteps[timestep]
noise = torch.randn_like(latents)
noisy_latents = self.noise_scheduler.add_noise(latents, noise, timestep)
# Prepare positional id
fps = 30
motion_bucket_id = self.motion_bucket_manager(frames.unsqueeze(0))[0]
noise_aug_strength = 0
add_time_id = torch.tensor([[fps-1, motion_bucket_id, noise_aug_strength]], device=self.device)
# Calculate loss
latents_input = torch.cat([noisy_latents, image_emb_vae], dim=1)
model_pred = self.unet(latents_input, timestep, image_emb_clip, add_time_id, use_gradient_checkpointing=True)
latents_output = self.noise_scheduler.step(model_pred.float(), timestep, noisy_latents.float(), to_final=True)
loss = torch.nn.functional.mse_loss(latents_output, latents.float() * self.contrast_enhance_scale, reduction="mean")
# Re-weighting
reweighted_loss = loss * self.noise_scheduler.training_weight(timestep)
return loss, reweighted_loss
def training_step(self, batch, batch_idx):
# Loss
frames = batch["frames_0"][0]
loss, reweighted_loss = self.calculate_loss(frames)
# Record log
self.log("train_loss", loss, prog_bar=True)
self.log("reweighted_train_loss", reweighted_loss, prog_bar=True)
return reweighted_loss
def configure_optimizers(self):
trainable_modules = []
for block in self.unet.blocks:
if isinstance(block, TemporalAttentionBlock):
trainable_modules += block.parameters()
optimizer = torch.optim.AdamW(trainable_modules, lr=self.learning_rate)
return optimizer
def on_save_checkpoint(self, checkpoint):
trainable_param_names = list(filter(lambda named_param: named_param[1].requires_grad, self.unet.named_parameters()))
trainable_param_names = [named_param[0] for named_param in trainable_param_names]
checkpoint["trainable_param_names"] = trainable_param_names
def parse_args():
parser = argparse.ArgumentParser(description="Simple example of a training script.")
parser.add_argument(
"--pretrained_path",
type=str,
default=None,
required=True,
help="Path to pretrained model. For example, `models/stable_video_diffusion/svd_xt.safetensors`.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
required=False,
help="Path to checkpoint, in case your training program is stopped unexpectedly and you want to resume.",
)
parser.add_argument(
"--dataset_path",
type=str,
default=None,
required=True,
help="The path of the Dataset.",
)
parser.add_argument(
"--output_path",
type=str,
default="./",
help="Path to save the model.",
)
parser.add_argument(
"--steps_per_epoch",
type=int,
default=500,
help="Number of steps per epoch.",
)
parser.add_argument(
"--num_frames",
type=int,
default=128,
help="Number of frames.",
)
parser.add_argument(
"--height",
type=int,
default=512,
help="Image height.",
)
parser.add_argument(
"--width",
type=int,
default=512,
help="Image width.",
)
parser.add_argument(
"--dataloader_num_workers",
type=int,
default=2,
help="Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.",
)
parser.add_argument(
"--learning_rate",
type=float,
default=1e-5,
help="Learning rate.",
)
parser.add_argument(
"--accumulate_grad_batches",
type=int,
default=1,
help="The number of batches in gradient accumulation.",
)
parser.add_argument(
"--max_epochs",
type=int,
default=1,
help="Number of epochs.",
)
parser.add_argument(
"--contrast_enhance_scale",
type=float,
default=1.01,
help="Avoid generating gray videos.",
)
args = parser.parse_args()
return args
if __name__ == '__main__':
# args
args = parse_args()
# dataset and data loader
dataset = TextVideoDataset(
args.dataset_path,
os.path.join(args.dataset_path, "metadata.json"),
training_shapes=[(args.num_frames, 1, args.num_frames, args.height, args.width)],
steps_per_epoch=args.steps_per_epoch,
)
train_loader = torch.utils.data.DataLoader(
dataset,
shuffle=True,
# We don't support batch_size > 1,
# because sometimes our GPU cannot process even one video.
batch_size=1,
num_workers=args.dataloader_num_workers
)
# model
model = LightningModel(
learning_rate=args.learning_rate,
svd_ckpt_path=args.pretrained_path,
add_positional_conv=args.num_frames,
contrast_enhance_scale=args.contrast_enhance_scale
)
# train
trainer = pl.Trainer(
max_epochs=args.max_epochs,
accelerator="gpu",
devices="auto",
strategy="deepspeed_stage_2",
precision="16-mixed",
default_root_dir=args.output_path,
accumulate_grad_batches=args.accumulate_grad_batches,
callbacks=[pl.pytorch.callbacks.ModelCheckpoint(save_top_k=-1)]
)
trainer.fit(
model=model,
train_dataloaders=train_loader,
ckpt_path=args.resume_from_checkpoint
)

View File

@@ -1,89 +0,0 @@
# ExVideo
ExVideo is a post-tuning technique aimed at enhancing the capability of video generation models. We have extended Stable Video Diffusion to achieve the generation of long videos up to 128 frames.
* [Project Page](https://ecnu-cilab.github.io/ExVideoProjectPage/)
* [Technical report](https://arxiv.org/abs/2406.14130)
* **[New]** Extended models (ExVideo-CogVideoX)
* [HuggingFace](https://huggingface.co/ECNU-CILab/ExVideo-CogVideoX-LoRA-129f-v1)
* [ModelScope](https://modelscope.cn/models/ECNU-CILab/ExVideo-CogVideoX-LoRA-129f-v1)
* Extended models (ExVideo-SVD)
* [HuggingFace](https://huggingface.co/ECNU-CILab/ExVideo-SVD-128f-v1)
* [ModelScope](https://modelscope.cn/models/ECNU-CILab/ExVideo-SVD-128f-v1)
## Example: Text-to-video via extended CogVideoX-5B
Generate a video using CogVideoX-5B and our extension module. See [ExVideo_cogvideox_test.py](./ExVideo_cogvideox_test.py).
https://github.com/user-attachments/assets/321ee04b-8c17-479e-8a95-8cbcf21f8d7e
## Example: Text-to-video via extended Stable Video Diffusion
Generate a video using a text-to-image model and our image-to-video model. See [ExVideo_svd_test.py](./ExVideo_svd_test.py).
https://github.com/modelscope/DiffSynth-Studio/assets/35051019/d97f6aa9-8064-4b5b-9d49-ed6001bb9acc
## Train
* Step 1: Install additional packages
```
pip install lightning deepspeed
```
* Step 2: Download base model (from [HuggingFace](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/svd_xt.safetensors) or [ModelScope](https://www.modelscope.cn/api/v1/models/AI-ModelScope/stable-video-diffusion-img2vid-xt/repo?Revision=master&FilePath=svd_xt.safetensors)) to `models/stable_video_diffusion/svd_xt.safetensors`.
* Step 3: Prepare datasets
```
path/to/your/dataset
├── metadata.json
└── videos
├── video_1.mp4
├── video_2.mp4
└── video_3.mp4
```
where the `metadata.json` is
```
[
{
"path": "videos/video_1.mp4"
},
{
"path": "videos/video_2.mp4"
},
{
"path": "videos/video_3.mp4"
}
]
```
* Step 4: Run
```
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -u ExVideo_svd_train.py \
--pretrained_path "models/stable_video_diffusion/svd_xt.safetensors" \
--dataset_path "path/to/your/dataset" \
--output_path "path/to/save/models" \
--steps_per_epoch 8000 \
--num_frames 128 \
--height 512 \
--width 512 \
--dataloader_num_workers 2 \
--learning_rate 1e-5 \
--max_epochs 100
```
* Step 5: Post-process checkpoints
Calculate Exponential Moving Average (EMA) and package it using `safetensors`.
```
python ExVideo_ema.py --output_path "path/to/save/models/lightning_logs/version_xx" --gamma 0.9
```
* Step 6: Enjoy your model
The EMA model is at `path/to/save/models/lightning_logs/version_xx/checkpoints/epoch=xx-step=yyy-ema.safetensors`. Load it in [ExVideo_svd_test.py](./ExVideo_svd_test.py) and then enjoy your model.

View File

@@ -1,33 +0,0 @@
# HunyuanVideo
[HunyuanVideo](https://github.com/Tencent/HunyuanVideo) is a video generation model trained by Tencent. We provide advanced VRAM management for this model, including three stages:
|VRAM required|Example script|Frames|Resolution|Note|
|-|-|-|-|-|
|80G|[hunyuanvideo_80G.py](hunyuanvideo_80G.py)|129|720*1280|No VRAM management.|
|24G|[hunyuanvideo_24G.py](hunyuanvideo_24G.py)|129|720*1280|The video is consistent with the original implementation, but it requires 5%~10% more time than [hunyuanvideo_80G.py](hunyuanvideo_80G.py)|
|6G|[hunyuanvideo_6G.py](hunyuanvideo_6G.py)|129|512*384|The base model doesn't support low resolutions. We recommend users to use some LoRA ([example](https://civitai.com/models/1032126/walking-animation-hunyuan-video)) trained using low resolutions.|
[HunyuanVideo-I2V](https://github.com/Tencent/HunyuanVideo-I2V) is the image-to-video generation version of HunyuanVideo. We also provide advanced VRAM management for this model.
|VRAM required|Example script|Frames|Resolution|Note|
|-|-|-|-|-|
|80G|[hunyuanvideo_i2v_80G.py](hunyuanvideo_i2v_80G.py)|129|720p|No VRAM management.|
|24G|[hunyuanvideo_i2v_24G.py](hunyuanvideo_i2v_24G.py)|129|720p|The video is consistent with the original implementation, but it requires 5%~10% more time than [hunyuanvideo_80G.py](hunyuanvideo_80G.py)|
## Gallery
Video generated by [hunyuanvideo_80G.py](hunyuanvideo_80G.py) and [hunyuanvideo_24G.py](hunyuanvideo_24G.py):
https://github.com/user-attachments/assets/48dd24bb-0cc6-40d2-88c3-10feed3267e9
Video generated by [hunyuanvideo_6G.py](hunyuanvideo_6G.py) using [this LoRA](https://civitai.com/models/1032126/walking-animation-hunyuan-video):
https://github.com/user-attachments/assets/2997f107-d02d-4ecb-89bb-5ce1a7f93817
Video to video generated by [hunyuanvideo_v2v_6G.py](./hunyuanvideo_v2v_6G.py) using [this LoRA](https://civitai.com/models/1032126/walking-animation-hunyuan-video):
https://github.com/user-attachments/assets/4b89e52e-ce42-434e-aa57-08f09dfa2b10
Video generated by [hunyuanvideo_i2v_80G.py](hunyuanvideo_i2v_80G.py) and [hunyuanvideo_i2v_24G.py](hunyuanvideo_i2v_24G.py):
https://github.com/user-attachments/assets/494f252a-c9af-440d-84ba-a8ddcdcc538a

View File

@@ -1,42 +0,0 @@
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video
download_models(["HunyuanVideo"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideo/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16, # you can use torch_dtype=torch.float8_e4m3fn to enable quantization.
device="cpu"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideo/text_encoder/model.safetensors",
"models/HunyuanVideo/text_encoder_2",
"models/HunyuanVideo/vae/pytorch_model.pt",
],
torch_dtype=torch.float16,
device="cpu"
)
# We support LoRA inference. You can use the following code to load your LoRA model.
# model_manager.load_lora("models/lora/xxx.safetensors", lora_alpha=1.0)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(
model_manager,
torch_dtype=torch.bfloat16,
device="cuda"
)
# Enjoy!
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
video = pipe(prompt, seed=0)
save_video(video, "video_girl.mp4", fps=30, quality=6)

View File

@@ -1,52 +0,0 @@
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video, FlowMatchScheduler, download_customized_models
download_models(["HunyuanVideo"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideo/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16, # you can use torch_dtype=torch.float8_e4m3fn to enable quantization.
device="cpu"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideo/text_encoder/model.safetensors",
"models/HunyuanVideo/text_encoder_2",
"models/HunyuanVideo/vae/pytorch_model.pt",
],
torch_dtype=torch.float16,
device="cpu"
)
# We support LoRA inference. You can use the following code to load your LoRA model.
# Example LoRA: https://civitai.com/models/1032126/walking-animation-hunyuan-video
download_customized_models(
model_id="AI-ModelScope/walking_animation_hunyuan_video",
origin_file_path="kxsr_walking_anim_v1-5.safetensors",
local_dir="models/lora"
)[0]
model_manager.load_lora("models/lora/kxsr_walking_anim_v1-5.safetensors", lora_alpha=1.0)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(
model_manager,
torch_dtype=torch.bfloat16,
device="cuda"
)
# This LoRA requires shift=9.0.
pipe.scheduler = FlowMatchScheduler(shift=9.0, sigma_min=0.0, extra_one_step=True)
# Enjoy!
for clothes_up in ["white t-shirt", "black t-shirt", "orange t-shirt"]:
for clothes_down in ["blue sports skirt", "red sports skirt", "white sports skirt"]:
prompt = f"kxsr, full body, no crop, A 3D-rendered CG animation video featuring a Gorgeous, mature, curvaceous, fair-skinned female girl with long silver hair and blue eyes. She wears a {clothes_up} and a {clothes_down}, walking offering a sense of fluid movement and vivid animation."
video = pipe(prompt, seed=0, height=512, width=384, num_frames=129, num_inference_steps=18, tile_size=(17, 16, 16), tile_stride=(12, 12, 12))
save_video(video, f"video-{clothes_up}-{clothes_down}.mp4", fps=30, quality=6)

View File

@@ -1,45 +0,0 @@
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video
download_models(["HunyuanVideo"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideo/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16, # you can use torch_dtype=torch.float8_e4m3fn to enable quantization.
device="cuda"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideo/text_encoder/model.safetensors",
"models/HunyuanVideo/text_encoder_2",
"models/HunyuanVideo/vae/pytorch_model.pt",
],
torch_dtype=torch.float16,
device="cuda"
)
# We support LoRA inference. You can use the following code to load your LoRA model.
# model_manager.load_lora("models/lora/xxx.safetensors", lora_alpha=1.0)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(
model_manager,
torch_dtype=torch.bfloat16,
device="cuda",
enable_vram_management=False
)
# Although you have enough VRAM, we still recommend you to enable offload.
pipe.enable_cpu_offload()
# Enjoy!
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
video = pipe(prompt, seed=0)
save_video(video, "video.mp4", fps=30, quality=6)

View File

@@ -1,43 +0,0 @@
import torch
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video
from modelscope import dataset_snapshot_download
from PIL import Image
download_models(["HunyuanVideoI2V"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideoI2V/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16,
device="cpu"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideoI2V/text_encoder/model.safetensors",
"models/HunyuanVideoI2V/text_encoder_2",
'models/HunyuanVideoI2V/vae/pytorch_model.pt'
],
torch_dtype=torch.float16,
device="cpu"
)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(model_manager,
torch_dtype=torch.bfloat16,
device="cuda",
enable_vram_management=True)
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth",
local_dir="./",
allow_file_pattern=f"data/examples/hunyuanvideo/*")
i2v_resolution = "720p"
prompt = "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick."
images = [Image.open("data/examples/hunyuanvideo/0.jpg").convert('RGB')]
video = pipe(prompt, input_images=images, num_inference_steps=50, seed=0, i2v_resolution=i2v_resolution)
save_video(video, f"video_{i2v_resolution}_low_vram.mp4", fps=30, quality=6)

View File

@@ -1,45 +0,0 @@
import torch
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video
from modelscope import dataset_snapshot_download
from PIL import Image
download_models(["HunyuanVideoI2V"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideoI2V/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16,
device="cuda"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideoI2V/text_encoder/model.safetensors",
"models/HunyuanVideoI2V/text_encoder_2",
'models/HunyuanVideoI2V/vae/pytorch_model.pt'
],
torch_dtype=torch.float16,
device="cuda"
)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(model_manager,
torch_dtype=torch.bfloat16,
device="cuda",
enable_vram_management=False)
# Although you have enough VRAM, we still recommend you to enable offload.
pipe.enable_cpu_offload()
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth",
local_dir="./",
allow_file_pattern=f"data/examples/hunyuanvideo/*")
i2v_resolution = "720p"
prompt = "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick."
images = [Image.open("data/examples/hunyuanvideo/0.jpg").convert('RGB')]
video = pipe(prompt, input_images=images, num_inference_steps=50, seed=0, i2v_resolution=i2v_resolution)
save_video(video, f"video_{i2v_resolution}.mp4", fps=30, quality=6)

View File

@@ -1,55 +0,0 @@
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video, FlowMatchScheduler, download_customized_models
download_models(["HunyuanVideo"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideo/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16, # you can use torch_dtype=torch.float8_e4m3fn to enable quantization.
device="cpu"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideo/text_encoder/model.safetensors",
"models/HunyuanVideo/text_encoder_2",
"models/HunyuanVideo/vae/pytorch_model.pt",
],
torch_dtype=torch.float16,
device="cpu"
)
# We support LoRA inference. You can use the following code to load your LoRA model.
# Example LoRA: https://civitai.com/models/1032126/walking-animation-hunyuan-video
download_customized_models(
model_id="AI-ModelScope/walking_animation_hunyuan_video",
origin_file_path="kxsr_walking_anim_v1-5.safetensors",
local_dir="models/lora"
)[0]
model_manager.load_lora("models/lora/kxsr_walking_anim_v1-5.safetensors", lora_alpha=1.0)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(
model_manager,
torch_dtype=torch.bfloat16,
device="cuda"
)
# This LoRA requires shift=9.0.
pipe.scheduler = FlowMatchScheduler(shift=9.0, sigma_min=0.0, extra_one_step=True)
# Text-to-video
prompt = f"kxsr, full body, no crop. A girl is walking. CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
video = pipe(prompt, seed=1, height=512, width=384, num_frames=129, num_inference_steps=18, tile_size=(17, 16, 16), tile_stride=(12, 12, 12))
save_video(video, f"video.mp4", fps=30, quality=6)
# Video-to-video
prompt = f"kxsr, full body, no crop. A girl is walking. CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, purple dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
video = pipe(prompt, seed=1, height=512, width=384, num_frames=129, num_inference_steps=18, tile_size=(17, 16, 16), tile_stride=(12, 12, 12), input_video=video, denoising_strength=0.85)
save_video(video, f"video_edited.mp4", fps=30, quality=6)

View File

@@ -1,7 +0,0 @@
# InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
We support the identity preserving feature of InfiniteYou. See [./infiniteyou.py](./infiniteyou.py) for example. The visualization of the result is shown below.
|Identity Image|Generated Image|
|-|-|
|![man_id](https://github.com/user-attachments/assets/bbc38a91-966e-49e8-a0d7-c5467582ad1f)|![man](https://github.com/user-attachments/assets/0decd5e1-5f65-437c-98fa-90991b6f23c1)|
|![woman_id](https://github.com/user-attachments/assets/b2894695-690e-465b-929c-61e5dc57feeb)|![woman](https://github.com/user-attachments/assets/67cc7496-c4d3-4de1-a8f1-9eb4991d95e8)|

View File

@@ -1,58 +0,0 @@
import importlib
import torch
from diffsynth import ModelManager, FluxImagePipeline, download_models, ControlNetConfigUnit
from modelscope import dataset_snapshot_download
from PIL import Image
if importlib.util.find_spec("facexlib") is None:
raise ImportError("You are using InifiniteYou. It depends on facexlib, which is not installed. Please install it with `pip install facexlib`.")
if importlib.util.find_spec("insightface") is None:
raise ImportError("You are using InifiniteYou. It depends on insightface, which is not installed. Please install it with `pip install insightface`.")
download_models(["InfiniteYou"])
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
model_manager.load_models([
[
"models/InfiniteYou/InfuseNetModel/diffusion_pytorch_model-00001-of-00002.safetensors",
"models/InfiniteYou/InfuseNetModel/diffusion_pytorch_model-00002-of-00002.safetensors"
],
"models/InfiniteYou/image_proj_model.bin",
])
pipe = FluxImagePipeline.from_model_manager(
model_manager,
controlnet_config_units=[
ControlNetConfigUnit(
processor_id="none",
model_path=[
'models/InfiniteYou/InfuseNetModel/diffusion_pytorch_model-00001-of-00002.safetensors',
'models/InfiniteYou/InfuseNetModel/diffusion_pytorch_model-00002-of-00002.safetensors'
],
scale=1.0
)
]
)
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/infiniteyou/*")
prompt = "A man, portrait, cinematic"
id_image = "data/examples/infiniteyou/man.jpg"
id_image = Image.open(id_image).convert('RGB')
image = pipe(
prompt=prompt, seed=1,
infinityou_id_image=id_image, infinityou_guidance=1.0,
num_inference_steps=50, embedded_guidance=3.5,
height=1024, width=1024,
)
image.save("man.jpg")
prompt = "A woman, portrait, cinematic"
id_image = "data/examples/infiniteyou/woman.jpg"
id_image = Image.open(id_image).convert('RGB')
image = pipe(
prompt=prompt, seed=1,
infinityou_id_image=id_image, infinityou_guidance=1.0,
num_inference_steps=50, embedded_guidance=3.5,
height=1024, width=1024,
)
image.save("woman.jpg")

View File

@@ -1,44 +0,0 @@
# IP-Adapter
IP-Adapter is a interesting model, which can adopt the content or style of another image to generate a new image.
## Example: Content Controlling in Stable Diffusion
Based on Stable Diffusion, we can transfer the object to another scene. See [`sd_ipadapter.py`](./sd_ipadapter.py).
|First, we generate a car. The prompt is "masterpiece, best quality, a car".|Next, utilizing IP-Adapter, we move the car to the road. The prompt is "masterpiece, best quality, a car running on the road".|
|-|-|
|![car](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/8530a2f0-f610-4269-a22c-ac6c2f21fc18)|![car_on_the_road](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/b8ccddb2-c423-46d8-bd1a-327fcc074a36)|
## Example: Content and Style Controlling in Stable Diffusion XL
The IP-Adapter model based on Stable Diffusion XL is more powerful. You have the option to use the content or style. See [`sdxl_ipadapter.py`](./sdxl_ipadapter.py).
* Content controlling (original usage of IP-Adapter)
|First, we generate a rabbit.|Next, enable IP-Adapter and let the rabbit jump.|For comparison, disable IP-Adapter to see the generated image.|
|-|-|-|
|![rabbit](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/4b452634-ec57-414f-897a-f8c50c74a650)|![rabbit_to_jumping_rabbit](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/b93c5495-0b77-4d97-bcd3-3942858288f2)|![rabbit_to_jumping_rabbit_without_ipa](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/52f37195-65b3-4a38-8d9b-73df37311c15)|
* Style controlling (InstantStyle)
|First, we generate a rabbit.|Next, enable InstantStyle and convert the rabbit to a cat.|For comparison, disable IP-Adapter to see the generated image.|
|-|-|-|
|![rabbit](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/4b452634-ec57-414f-897a-f8c50c74a650)|![rabbit_to_cat](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/a006b281-f643-4ea9-b0da-712289c96059)|![rabbit_to_cat_without_ipa](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/189bd11e-7a10-4c09-8554-0eebde9150fd)|
## Example: Image Fusing (Experimental)
Since IP-Adapter can control the content based on more than one image, we can do something interesting. See [`sdxl_ipadapter_multi_reference.py`](sdxl_ipadapter_multi_reference.py).
We have two pokemons here:
|Charizard|Pikachu|
|-|-|
|![](https://media.52poke.com/wiki/7/7e/006Charizard.png)|![](https://media.52poke.com/wiki/0/0d/025Pikachu.png)|
Fuse!
|Pikazard ???|
|-|
|![Pikazard](https://github.com/modelscope/DiffSynth-Studio/assets/35051019/807cdb31-94f5-4cc2-a978-3c6a7ffedc5b)|

View File

@@ -1,38 +0,0 @@
from diffsynth import ModelManager, download_models, FluxImagePipeline
import torch
# Download models (automatically)
# `models/IpAdapter/InstantX/FLUX.1-dev-IP-Adapter/ip-adapter.bin`: [link](https://huggingface.co/InstantX/FLUX.1-dev-IP-Adapter/blob/main/ip-adapter.bin)
# `models/IpAdapter/InstantX/FLUX.1-dev-IP-Adapter/image_encoder`: [link](https://huggingface.co/google/siglip-so400m-patch14-384)
download_models(["InstantX/FLUX.1-dev-IP-Adapter", "FLUX.1-dev"])
# Load models
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda")
model_manager.load_models([
"models/IpAdapter/InstantX/FLUX.1-dev-IP-Adapter/ip-adapter.bin",
"models/IpAdapter/InstantX/FLUX.1-dev-IP-Adapter/image_encoder",
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
"models/FLUX/FLUX.1-dev/text_encoder_2",
"models/FLUX/FLUX.1-dev/ae.safetensors",
"models/FLUX/FLUX.1-dev/flux1-dev.safetensors",
])
seed = 42
pipe = FluxImagePipeline.from_model_manager(model_manager)
torch.manual_seed(seed)
origin_prompt = "a rabbit in a garden, colorful flowers"
image = pipe(
prompt=origin_prompt,
cfg_scale=1.0, embedded_guidance=3.5,
height=1280, width=960, num_inference_steps=30
)
image.save("style image.jpg")
torch.manual_seed(seed)
image = pipe(
prompt="A piggy",
cfg_scale=1.0, embedded_guidance=3.5,
height=1280, width=960, num_inference_steps=30,
ipadapter_images=[image], ipadapter_scale=0.7
)
image.save("A piggy.jpg")

View File

@@ -1,38 +0,0 @@
from diffsynth import ModelManager, SDImagePipeline, download_models
import torch
# Download models (automatically)
# `models/stable_diffusion/aingdiffusion_v12.safetensors`: [link](https://civitai.com/api/download/models/229575?type=Model&format=SafeTensor&size=full&fp=fp16)
# `models/IpAdapter/stable_diffusion/image_encoder/model.safetensors`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/models/image_encoder/model.safetensors)
# `models/IpAdapter/stable_diffusion/ip-adapter_sd15.bin`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/models/ip-adapter_sd15.bin)
# `models/textual_inversion/verybadimagenegative_v1.3.pt`: [link](https://civitai.com/api/download/models/25820?type=Model&format=PickleTensor&size=full&fp=fp16)
download_models(["AingDiffusion_v12", "IP-Adapter-SD", "TextualInversion_VeryBadImageNegative_v1.3"])
# Load models
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda")
model_manager.load_models([
"models/stable_diffusion/aingdiffusion_v12.safetensors",
"models/IpAdapter/stable_diffusion/image_encoder/model.safetensors",
"models/IpAdapter/stable_diffusion/ip-adapter_sd15.bin"
])
pipe = SDImagePipeline.from_model_manager(model_manager)
pipe.prompter.load_textual_inversions(["models/textual_inversion/verybadimagenegative_v1.3.pt"])
torch.manual_seed(1)
style_image = pipe(
prompt="masterpiece, best quality, a car",
negative_prompt="verybadimagenegative_v1.3",
cfg_scale=7, clip_skip=2,
height=512, width=512, num_inference_steps=50,
)
style_image.save("car.jpg")
image = pipe(
prompt="masterpiece, best quality, a car running on the road",
negative_prompt="verybadimagenegative_v1.3",
cfg_scale=7, clip_skip=2,
height=512, width=512, num_inference_steps=50,
ipadapter_images=[style_image], ipadapter_scale=1.0
)
image.save("car_on_the_road.jpg")

View File

@@ -1,61 +0,0 @@
from diffsynth import ModelManager, SDXLImagePipeline, download_models
import torch
# Download models (automatically)
# `models/stable_diffusion_xl/sd_xl_base_1.0.safetensors`: [link](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors)
# `models/IpAdapter/stable_diffusion_xl/image_encoder/model.safetensors`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/sdxl_models/image_encoder/model.safetensors)
# `models/IpAdapter/stable_diffusion_xl/ip-adapter_sdxl.bin`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/sdxl_models/ip-adapter_sdxl.safetensors)
download_models(["StableDiffusionXL_v1", "IP-Adapter-SDXL"])
# Load models
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda")
model_manager.load_models([
"models/stable_diffusion_xl/sd_xl_base_1.0.safetensors",
"models/IpAdapter/stable_diffusion_xl/image_encoder/model.safetensors",
"models/IpAdapter/stable_diffusion_xl/ip-adapter_sdxl.bin"
])
pipe = SDXLImagePipeline.from_model_manager(model_manager)
torch.manual_seed(123456)
style_image = pipe(
prompt="a rabbit in a garden, colorful flowers",
negative_prompt="anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
)
style_image.save("rabbit.jpg")
image = pipe(
prompt="a cat",
negative_prompt="",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
ipadapter_images=[style_image], ipadapter_use_instant_style=True
)
image.save("rabbit_to_cat.jpg")
image = pipe(
prompt="a rabbit is jumping",
negative_prompt="",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
ipadapter_images=[style_image], ipadapter_use_instant_style=False, ipadapter_scale=0.5
)
image.save("rabbit_to_jumping_rabbit.jpg")
image = pipe(
prompt="a cat",
negative_prompt="",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
)
image.save("rabbit_to_cat_without_ipa.jpg")
image = pipe(
prompt="a rabbit is jumping",
negative_prompt="",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
)
image.save("rabbit_to_jumping_rabbit_without_ipa.jpg")

View File

@@ -1,34 +0,0 @@
from diffsynth import ModelManager, SDXLImagePipeline, download_models
import torch, requests
from PIL import Image
# Download models (automatically)
# `models/stable_diffusion_xl/bluePencilXL_v200.safetensors`: [link](https://civitai.com/api/download/models/245614?type=Model&format=SafeTensor&size=pruned&fp=fp16)
# `models/IpAdapter/stable_diffusion_xl/image_encoder/model.safetensors`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/sdxl_models/image_encoder/model.safetensors)
# `models/IpAdapter/stable_diffusion_xl/ip-adapter_sdxl.bin`: [link](https://huggingface.co/h94/IP-Adapter/resolve/main/sdxl_models/ip-adapter_sdxl.safetensors)
download_models(["BluePencilXL_v200", "IP-Adapter-SDXL"])
# Load models
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda")
model_manager.load_models([
"models/stable_diffusion_xl/bluePencilXL_v200.safetensors",
"models/IpAdapter/stable_diffusion_xl/image_encoder/model.safetensors",
"models/IpAdapter/stable_diffusion_xl/ip-adapter_sdxl.bin"
])
pipe = SDXLImagePipeline.from_model_manager(model_manager)
image_1 = Image.open(requests.get("https://media.52poke.com/wiki/7/7e/006Charizard.png", stream=True).raw).convert("RGB").resize((1024, 1024))
image_1.save("Charizard.jpg")
image_2 = Image.open(requests.get("https://media.52poke.com/wiki/0/0d/025Pikachu.png", stream=True).raw).convert("RGB").resize((1024, 1024))
image_2.save("Pikachu.jpg")
torch.manual_seed(0)
image = pipe(
prompt="a pokemon, maybe Charizard, maybe Pikachu",
negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
cfg_scale=5,
height=1024, width=1024, num_inference_steps=50,
ipadapter_images=[image_1, image_2], ipadapter_use_instant_style=False, ipadapter_scale=0.7
)
image.save(f"Pikazard.jpg")

View File

@@ -1,34 +0,0 @@
# TeaCache
TeaCache ([Timestep Embedding Aware Cache](https://github.com/ali-vilab/TeaCache)) is a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps, thereby accelerating the inference.
## Examples
### FLUX
Script: [./flux_teacache.py](./flux_teacache.py)
Model: FLUX.1-dev
Steps: 50
GPU: A100
|TeaCache is disabled|tea_cache_l1_thresh=0.2|tea_cache_l1_thresh=0.8|
|-|-|-|
|23s|13s|5s|
|![image_None](https://github.com/user-attachments/assets/2bf5187a-9693-44d3-9ebb-6c33cd15443f)|![image_0 2](https://github.com/user-attachments/assets/5532ba94-c7e2-446e-a9ba-1c68c0f63350)|![image_0 8](https://github.com/user-attachments/assets/d8cfdd74-8b45-4048-b1b7-ce480aa23fa1)
### Hunyuan Video
Script: [./hunyuanvideo_teacache.py](./hunyuanvideo_teacache.py)
Model: Hunyuan Video
Steps: 30
GPU: A100
The following video was generated using TeaCache. It is nearly identical to [the video without TeaCache enabled](https://github.com/user-attachments/assets/48dd24bb-0cc6-40d2-88c3-10feed3267e9), but with double the speed.
https://github.com/user-attachments/assets/cd9801c5-88ce-4efc-b055-2c7737166f34

View File

@@ -1,15 +0,0 @@
import torch
from diffsynth import ModelManager, FluxImagePipeline
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda", model_id_list=["FLUX.1-dev"])
pipe = FluxImagePipeline.from_model_manager(model_manager)
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
for tea_cache_l1_thresh in [None, 0.2, 0.4, 0.6, 0.8]:
image = pipe(
prompt=prompt, embedded_guidance=3.5, seed=0,
num_inference_steps=50, tea_cache_l1_thresh=tea_cache_l1_thresh
)
image.save(f"image_{tea_cache_l1_thresh}.png")

View File

@@ -1,42 +0,0 @@
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
from diffsynth import ModelManager, HunyuanVideoPipeline, download_models, save_video
download_models(["HunyuanVideo"])
model_manager = ModelManager()
# The DiT model is loaded in bfloat16.
model_manager.load_models(
[
"models/HunyuanVideo/transformers/mp_rank_00_model_states.pt"
],
torch_dtype=torch.bfloat16, # you can use torch_dtype=torch.float8_e4m3fn to enable quantization.
device="cpu"
)
# The other modules are loaded in float16.
model_manager.load_models(
[
"models/HunyuanVideo/text_encoder/model.safetensors",
"models/HunyuanVideo/text_encoder_2",
"models/HunyuanVideo/vae/pytorch_model.pt",
],
torch_dtype=torch.float16,
device="cpu"
)
# We support LoRA inference. You can use the following code to load your LoRA model.
# model_manager.load_lora("models/lora/xxx.safetensors", lora_alpha=1.0)
# The computation device is "cuda".
pipe = HunyuanVideoPipeline.from_model_manager(
model_manager,
torch_dtype=torch.bfloat16,
device="cuda"
)
# Enjoy!
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
video = pipe(prompt, seed=0, tea_cache_l1_thresh=0.15)
save_video(video, "video_girl.mp4", fps=30, quality=6)

View File

@@ -0,0 +1,43 @@
import re, os
def read_file(path):
with open(path, "r", encoding="utf-8-sig") as f:
context = f.read()
return context
def get_files(files, path):
if os.path.isdir(path):
for folder in os.listdir(path):
get_files(files, os.path.join(path, folder))
elif path.endswith(".md"):
files.append(path)
def fix_path(doc_root_path):
files = []
get_files(files, doc_root_path)
file_map = {}
for file in files:
name = file.split("/")[-1]
file_map[name] = "/" + file
pattern = re.compile(r'\]\([^)]*\.md')
for file in files:
context = read_file(file)
matches = pattern.findall(context)
edited = False
for match in matches:
target = "](" + file_map[match.split("/")[-1].replace("](", "")]
context = context.replace(match, target)
if target != match:
print(match, target)
edited = True
print(file, match, target)
if edited:
with open(file, "w", encoding="utf-8") as f:
f.write(context)
fix_path("doc/zh")
fix_path("doc/en")

View File

@@ -0,0 +1,114 @@
import os, shutil, multiprocessing, time
NUM_GPUS = 7
def script_is_processed(output_path, script):
return os.path.exists(os.path.join(output_path, script)) and "log.txt" in os.listdir(os.path.join(output_path, script))
def filter_unprocessed_tasks(script_path):
tasks = []
output_path = os.path.join("data", script_path)
for script in sorted(os.listdir(script_path)):
if not script.endswith(".sh") and not script.endswith(".py"):
continue
if script_is_processed(output_path, script):
continue
tasks.append(script)
return tasks
def run_inference(script_path):
tasks = filter_unprocessed_tasks(script_path)
output_path = os.path.join("data", script_path)
for script in tasks:
source_path = os.path.join(script_path, script)
target_path = os.path.join(output_path, script)
os.makedirs(target_path, exist_ok=True)
cmd = f"python {source_path} > {target_path}/log.txt 2>&1"
print(cmd, flush=True)
os.system(cmd)
for file_name in os.listdir("./"):
if file_name.endswith(".jpg") or file_name.endswith(".png") or file_name.endswith(".mp4"):
shutil.move(file_name, os.path.join(target_path, file_name))
def run_tasks_on_single_GPU(script_path, tasks, gpu_id, num_gpu):
output_path = os.path.join("data", script_path)
for script_id, script in enumerate(tasks):
if script_id % num_gpu != gpu_id:
continue
source_path = os.path.join(script_path, script)
target_path = os.path.join(output_path, script)
os.makedirs(target_path, exist_ok=True)
if script.endswith(".sh"):
cmd = f"CUDA_VISIBLE_DEVICES={gpu_id} bash {source_path} > {target_path}/log.txt 2>&1"
elif script.endswith(".py"):
cmd = f"CUDA_VISIBLE_DEVICES={gpu_id} python {source_path} > {target_path}/log.txt 2>&1"
print(cmd, flush=True)
os.system(cmd)
def run_train_multi_GPU(script_path):
tasks = filter_unprocessed_tasks(script_path)
output_path = os.path.join("data", script_path)
for script in tasks:
source_path = os.path.join(script_path, script)
target_path = os.path.join(output_path, script)
os.makedirs(target_path, exist_ok=True)
cmd = f"bash {source_path} > {target_path}/log.txt 2>&1"
print(cmd, flush=True)
os.system(cmd)
time.sleep(1)
def run_train_single_GPU(script_path):
tasks = filter_unprocessed_tasks(script_path)
processes = [multiprocessing.Process(target=run_tasks_on_single_GPU, args=(script_path, tasks, i, NUM_GPUS)) for i in range(NUM_GPUS)]
for p in processes:
p.start()
for p in processes:
p.join()
def move_files(prefix, target_folder):
os.makedirs(target_folder, exist_ok=True)
os.system(f"cp -r {prefix}* {target_folder}")
os.system(f"rm -rf {prefix}*")
def test_qwen_image():
run_inference("examples/qwen_image/model_inference")
run_inference("examples/qwen_image/model_inference_low_vram")
run_train_multi_GPU("examples/qwen_image/model_training/full")
run_inference("examples/qwen_image/model_training/validate_full")
run_train_single_GPU("examples/qwen_image/model_training/lora")
run_inference("examples/qwen_image/model_training/validate_lora")
def test_wan():
run_train_single_GPU("examples/wanvideo/model_inference")
move_files("video_", "data/output/model_inference")
run_train_single_GPU("examples/wanvideo/model_inference_low_vram")
move_files("video_", "data/output/model_inference_low_vram")
run_train_multi_GPU("examples/wanvideo/model_training/full")
run_train_single_GPU("examples/wanvideo/model_training/validate_full")
move_files("video_", "data/output/validate_full")
run_train_single_GPU("examples/wanvideo/model_training/lora")
run_train_single_GPU("examples/wanvideo/model_training/validate_lora")
move_files("video_", "data/output/validate_lora")
def test_flux():
run_inference("examples/flux/model_inference")
run_inference("examples/flux/model_inference_low_vram")
run_train_multi_GPU("examples/flux/model_training/full")
run_inference("examples/flux/model_training/validate_full")
run_train_single_GPU("examples/flux/model_training/lora")
run_inference("examples/flux/model_training/validate_lora")
if __name__ == "__main__":
test_qwen_image()
test_flux()
test_wan()

View File

@@ -1,7 +0,0 @@
# DiffSynth
DiffSynth is the initial version of our video synthesis framework. In this framework, you can apply video deflickering algorithms to the latent space of diffusion models. You can refer to the [original repo](https://github.com/alibaba/EasyNLP/tree/master/diffusion/DiffSynth) for more details.
We provide an example for video stylization. In this pipeline, the rendered video is completely different from the original video, thus we need a powerful deflickering algorithm. We use FastBlend to implement the deflickering module. Please see [`sd_video_rerender.py`](./sd_video_rerender.py).
https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-4481-b79f-0c3a7361a1ea

View File

@@ -1,64 +0,0 @@
from diffsynth import ModelManager, SDVideoPipeline, ControlNetConfigUnit, VideoData, save_video, download_models
from diffsynth.processors.FastBlend import FastBlendSmoother
from diffsynth.processors.PILEditor import ContrastEditor, SharpnessEditor
from diffsynth.processors.sequencial_processor import SequencialProcessor
import torch
# Download models (automatically)
# `models/stable_diffusion/dreamshaper_8.safetensors`: [link](https://civitai.com/api/download/models/128713?type=Model&format=SafeTensor&size=pruned&fp=fp16)
# `models/ControlNet/control_v11f1p_sd15_depth.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11f1p_sd15_depth.pth)
# `models/ControlNet/control_v11p_sd15_softedge.pth`: [link](https://huggingface.co/lllyasviel/ControlNet-v1-1/resolve/main/control_v11p_sd15_softedge.pth)
# `models/Annotators/dpt_hybrid-midas-501f0c75.pt`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/dpt_hybrid-midas-501f0c75.pt)
# `models/Annotators/ControlNetHED.pth`: [link](https://huggingface.co/lllyasviel/Annotators/resolve/main/ControlNetHED.pth)
download_models([
"ControlNet_v11f1p_sd15_depth",
"ControlNet_v11p_sd15_softedge",
"DreamShaper_8"
])
# Load models
model_manager = ModelManager(
torch_dtype=torch.float16, device="cuda",
file_path_list=[
"models/stable_diffusion/dreamshaper_8.safetensors",
"models/ControlNet/control_v11f1p_sd15_depth.pth",
"models/ControlNet/control_v11p_sd15_softedge.pth",
]
)
pipe = SDVideoPipeline.from_model_manager(
model_manager,
[
ControlNetConfigUnit(
processor_id="depth",
model_path=rf"models/ControlNet/control_v11f1p_sd15_depth.pth",
scale=0.5
),
ControlNetConfigUnit(
processor_id="softedge",
model_path=rf"models/ControlNet/control_v11p_sd15_softedge.pth",
scale=0.5
)
]
)
smoother = SequencialProcessor([FastBlendSmoother(), ContrastEditor(rate=1.1), SharpnessEditor(rate=1.1)])
# Load video
# Original video: https://pixabay.com/videos/flow-rocks-water-fluent-stones-159627/
video = VideoData(video_file="data/examples/pixabay100/159627 (1080p).mp4", height=512, width=768)
input_video = [video[i] for i in range(128)]
# Rerender
torch.manual_seed(0)
output_video = pipe(
prompt="winter, ice, snow, water, river",
negative_prompt="", cfg_scale=7,
input_frames=input_video, controlnet_frames=input_video, num_frames=len(input_video),
num_inference_steps=20, height=512, width=768,
animatediff_batch_size=8, animatediff_stride=4, unet_batch_size=8,
cross_frame_attention=True,
smoother=smoother, smoother_progress_ids=[4, 9, 14, 19]
)
# Save images and video
save_video(output_video, "output_video.mp4", fps=30)

View File

@@ -1,395 +0,0 @@
# FLUX
[切换到中文](./README_zh.md)
FLUX is a series of image generation models open-sourced by Black-Forest-Labs.
**DiffSynth-Studio has introduced a new inference and training framework. If you need to use the old version, please click [here](https://github.com/modelscope/DiffSynth-Studio/tree/3edf3583b1f08944cee837b94d9f84d669c2729c).**
## Installation
Before using these models, please install DiffSynth-Studio from source code:
```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
## Quick Start
You can quickly load the [black-forest-labs/FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev ) model and run inference by executing the code below.
```python
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")
```
## Model Overview
|Model ID|Extra Args|Inference|Low VRAM Inference|Full Training|Validation after Full Training|LoRA Training|Validation after LoRA Training|
|-|-|-|-|-|-|-|-|
|[FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev)||[code](./model_inference/FLUX.1-dev.py)|[code](./model_inference_low_vram/FLUX.1-dev.py)|[code](./model_training/full/FLUX.1-dev.sh)|[code](./model_training/validate_full/FLUX.1-dev.py)|[code](./model_training/lora/FLUX.1-dev.sh)|[code](./model_training/validate_lora/FLUX.1-dev.py)|
|[FLUX.1-Krea-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Krea-dev)||[code](./model_inference/FLUX.1-Krea-dev.py)|[code](./model_inference_low_vram/FLUX.1-Krea-dev.py)|[code](./model_training/full/FLUX.1-Krea-dev.sh)|[code](./model_training/validate_full/FLUX.1-Krea-dev.py)|[code](./model_training/lora/FLUX.1-Krea-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Krea-dev.py)|
|[FLUX.1-Kontext-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev)|`kontext_images`|[code](./model_inference/FLUX.1-Kontext-dev.py)|[code](./model_inference_low_vram/FLUX.1-Kontext-dev.py)|[code](./model_training/full/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_full/FLUX.1-Kontext-dev.py)|[code](./model_training/lora/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Kontext-dev.py)|
|[FLUX.1-dev-Controlnet-Inpainting-Beta](https://www.modelscope.cn/models/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|
|[FLUX.1-dev-Controlnet-Union-alpha](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Union-alpha.py)|
|[FLUX.1-dev-Controlnet-Upscaler](https://www.modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Upscaler)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Upscaler.py)|
|[FLUX.1-dev-IP-Adapter](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-IP-Adapter)|`ipadapter_images`, `ipadapter_scale`|[code](./model_inference/FLUX.1-dev-IP-Adapter.py)|[code](./model_inference_low_vram/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/full/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_full/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/lora/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_lora/FLUX.1-dev-IP-Adapter.py)|
|[FLUX.1-dev-InfiniteYou](https://www.modelscope.cn/models/ByteDance/InfiniteYou)|`infinityou_id_image`, `infinityou_guidance`, `controlnet_inputs`|[code](./model_inference/FLUX.1-dev-InfiniteYou.py)|[code](./model_inference_low_vram/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/full/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_full/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/lora/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_lora/FLUX.1-dev-InfiniteYou.py)|
|[FLUX.1-dev-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Eligen)|`eligen_entity_prompts`, `eligen_entity_masks`, `eligen_enable_on_negative`, `eligen_enable_inpaint`|[code](./model_inference/FLUX.1-dev-EliGen.py)|[code](./model_inference_low_vram/FLUX.1-dev-EliGen.py)|-|-|[code](./model_training/lora/FLUX.1-dev-EliGen.sh)|[code](./model_training/validate_lora/FLUX.1-dev-EliGen.py)|
|[FLUX.1-dev-LoRA-Encoder](https://www.modelscope.cn/models/DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev)|`lora_encoder_inputs`, `lora_encoder_scale`|[code](./model_inference/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_inference_low_vram/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_training/full/FLUX.1-dev-LoRA-Encoder.sh)|[code](./model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py)|-|-|
|[FLUX.1-dev-LoRA-Fusion-Preview](https://modelscope.cn/models/DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev)||[code](./model_inference/FLUX.1-dev-LoRA-Fusion.py)|-|-|-|-|-|
|[Step1X-Edit](https://www.modelscope.cn/models/stepfun-ai/Step1X-Edit)|`step1x_reference_image`|[code](./model_inference/Step1X-Edit.py)|[code](./model_inference_low_vram/Step1X-Edit.py)|[code](./model_training/full/Step1X-Edit.sh)|[code](./model_training/validate_full/Step1X-Edit.py)|[code](./model_training/lora/Step1X-Edit.sh)|[code](./model_training/validate_lora/Step1X-Edit.py)|
|[FLEX.2-preview](https://www.modelscope.cn/models/ostris/Flex.2-preview)|`flex_inpaint_image`, `flex_inpaint_mask`, `flex_control_image`, `flex_control_strength`, `flex_control_stop`|[code](./model_inference/FLEX.2-preview.py)|[code](./model_inference_low_vram/FLEX.2-preview.py)|[code](./model_training/full/FLEX.2-preview.sh)|[code](./model_training/validate_full/FLEX.2-preview.py)|[code](./model_training/lora/FLEX.2-preview.sh)|[code](./model_training/validate_lora/FLEX.2-preview.py)|
|[Nexus-Gen](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2)|`nexus_gen_reference_image`|[code](./model_inference/Nexus-Gen-Editing.py)|[code](./model_inference_low_vram/Nexus-Gen-Editing.py)|[code](./model_training/full/Nexus-Gen.sh)|[code](./model_training/validate_full/Nexus-Gen.py)|[code](./model_training/lora/Nexus-Gen.sh)|[code](./model_training/validate_lora/Nexus-Gen.py)|
## Model Inference
The following sections will help you understand our features and write inference code.
<details>
<summary>Load Model</summary>
The model is loaded using `from_pretrained`:
```python
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
```
Here, `torch_dtype` and `device` set the computation precision and device. The `model_configs` can be used in different ways to specify model paths:
* Download the model from [ModelScope](https://modelscope.cn/ ) and load it. In this case, fill in `model_id` and `origin_file_pattern`, for example:
```python
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors")
```
* Load the model from a local file path. In this case, fill in `path`, for example:
```python
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors")
```
For a single model that loads from multiple files, use a list, for example:
```python
ModelConfig(path=[
"models/xxx/diffusion_pytorch_model-00001-of-00003.safetensors",
"models/xxx/diffusion_pytorch_model-00002-of-00003.safetensors",
"models/xxx/diffusion_pytorch_model-00003-of-00003.safetensors",
])
```
The `ModelConfig` method also provides extra arguments to control model loading behavior:
* `local_model_path`: Path to save downloaded models. Default is `"./models"`.
* `skip_download`: Whether to skip downloading. Default is `False`. If your network cannot access [ModelScope](https://modelscope.cn/ ), download the required files manually and set this to `True`.
</details>
<details>
<summary>VRAM Management</summary>
DiffSynth-Studio provides fine-grained VRAM management for the FLUX model. This allows the model to run on devices with low VRAM. You can enable the offload feature using the code below. It moves some modules to CPU memory when GPU memory is limited.
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu"),
],
)
pipe.enable_vram_management()
```
FP8 quantization is also supported:
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_dtype=torch.float8_e4m3fn),
],
)
pipe.enable_vram_management()
```
You can use FP8 quantization and offload at the same time:
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
],
)
pipe.enable_vram_management()
```
After enabling VRAM management, the framework will automatically decide the VRAM strategy based on available GPU memory. For most FLUX models, inference can run with as little as 8GB of VRAM. The `enable_vram_management` function has the following parameters to manually control the VRAM strategy:
* `vram_limit`: VRAM usage limit in GB. By default, it uses all free VRAM on the device. Note that this is not an absolute limit. If the set VRAM is not enough but more VRAM is actually available, the model will run with minimal VRAM usage. Setting it to 0 achieves the theoretical minimum VRAM usage.
* `vram_buffer`: VRAM buffer size in GB. Default is 0.5GB. A buffer is needed because larger neural network layers may use more VRAM than expected during loading. The optimal value is the VRAM used by the largest layer in the model.
* `num_persistent_param_in_dit`: Number of parameters in the DiT model that stay in VRAM. Default is no limit. We plan to remove this parameter in the future. Do not rely on it.
</details>
<details>
<summary>Inference Acceleration</summary>
* TeaCache: Acceleration technique [TeaCache](https://github.com/ali-vilab/TeaCache ). Please refer to the [example code](./acceleration/teacache.py).
</details>
<details>
<summary>Input Parameters</summary>
The pipeline supports the following input parameters during inference:
* `prompt`: Text prompt describing what should appear in the image.
* `negative_prompt`: Negative prompt describing what should not appear in the image. Default is `""`.
* `cfg_scale`: Parameter for classifier-free guidance. Default is 1. Takes effect when set to a value greater than 1.
* `embedded_guidance`: Built-in guidance parameter for FLUX-dev. Default is 3.5.
* `t5_sequence_length`: Sequence length of text embeddings from the T5 model. Default is 512.
* `input_image`: Input image used for image-to-image generation. Used together with `denoising_strength`.
* `denoising_strength`: Denoising strength, range from 0 to 1. Default is 1. When close to 0, the output image is similar to the input. When close to 1, the output differs more from the input. Do not set it to values other than 1 if `input_image` is not provided.
* `height`: Image height. Must be a multiple of 16.
* `width`: Image width. Must be a multiple of 16.
* `seed`: Random seed. Default is `None`, meaning fully random.
* `rand_device`: Device for generating random Gaussian noise. Default is `"cpu"`. Setting it to `"cuda"` may lead to different results on different GPUs.
* `sigma_shift`: Parameter from Rectified Flow theory. Default is 3. A larger value means the model spends more steps at the start of denoising. Increasing this can improve image quality, but may cause differences between generated images and training data due to inconsistency with training.
* `num_inference_steps`: Number of inference steps. Default is 30.
* `kontext_images`: Input images for the Kontext model.
* `controlnet_inputs`: Inputs for the ControlNet model.
* `ipadapter_images`: Input images for the IP-Adapter model.
* `ipadapter_scale`: Control strength for the IP-Adapter model.
* `eligen_entity_prompts`: Local prompts for the EliGen model.
* `eligen_entity_masks`: Mask regions for local prompts in the EliGen model. Matches one-to-one with `eligen_entity_prompts`.
* `eligen_enable_on_negative`: Whether to enable EliGen on the negative prompt side. Only works when `cfg_scale > 1`.
* `eligen_enable_inpaint`: Whether to enable EliGen for local inpainting.
* `infinityou_id_image`: Face image for the InfiniteYou model.
* `infinityou_guidance`: Control strength for the InfiniteYou model.
* `flex_inpaint_image`: Image for FLEX model's inpainting.
* `flex_inpaint_mask`: Mask region for FLEX model's inpainting.
* `flex_control_image`: Image for FLEX model's structural control.
* `flex_control_strength`: Strength for FLEX model's structural control.
* `flex_control_stop`: End point for FLEX model's structural control. 1 means enabled throughout, 0.5 means enabled in the first half, 0 means disabled.
* `step1x_reference_image`: Input image for Step1x-Edit model's image editing.
* `lora_encoder_inputs`: Inputs for LoRA encoder. Can be ModelConfig or local path.
* `lora_encoder_scale`: Activation strength for LoRA encoder. Default is 1. Smaller values mean weaker LoRA activation.
* `tea_cache_l1_thresh`: Threshold for TeaCache. Larger values mean faster speed but lower image quality. Note that after enabling TeaCache, inference speed is not uniform, so the remaining time shown in the progress bar will be inaccurate.
* `tiled`: Whether to enable tiled VAE inference. Default is `False`. Setting to `True` reduces VRAM usage during VAE encoding/decoding, with slight error and slightly longer inference time.
* `tile_size`: Tile size during VAE encoding/decoding. Default is 128. Only takes effect when `tiled=True`.
* `tile_stride`: Tile stride during VAE encoding/decoding. Default is 64. Only takes effect when `tiled=True`. Must be less than or equal to `tile_size`.
* `progress_bar_cmd`: Progress bar display. Default is `tqdm.tqdm`. Set to `lambda x:x` to disable the progress bar.
</details>
## Model Training
Training for the FLUX series models is done using a unified script [`./model_training/train.py`](./model_training/train.py).
<details>
<summary>Script Parameters</summary>
The script includes the following parameters:
* Dataset
* `--dataset_base_path`: Root path of the dataset.
* `--dataset_metadata_path`: Path to the dataset metadata file.
* `--max_pixels`: Maximum pixel area. Default is 1024*1024. When dynamic resolution is enabled, any image with resolution higher than this will be downscaled.
* `--height`: Height of the image or video. Leave `height` and `width` empty to enable dynamic resolution.
* `--width`: Width of the image or video. Leave `height` and `width` empty to enable dynamic resolution.
* `--data_file_keys`: Data file keys in the metadata. Separate with commas.
* `--dataset_repeat`: Number of times the dataset repeats per epoch.
* `--dataset_num_workers`: Number of workers for data loading.
* Model
* `--model_paths`: Paths to load models. In JSON format.
* `--model_id_with_origin_paths`: Model ID with original paths, e.g., black-forest-labs/FLUX.1-dev:flux1-dev.safetensors. Separate with commas.
* Training
* `--learning_rate`: Learning rate.
* `--weight_decay`: Weight decay.
* `--num_epochs`: Number of epochs.
* `--output_path`: Save path.
* `--remove_prefix_in_ckpt`: Remove prefix in checkpoint.
* `--save_steps`: Number of checkpoint saving invervals. If None, checkpoints will be saved every epoch.
* `--find_unused_parameters`: Whether to find unused parameters in DDP.
* Trainable Modules
* `--trainable_models`: Models that can be trained, e.g., dit, vae, text_encoder.
* `--lora_base_model`: Which model to add LoRA to.
* `--lora_target_modules`: Which layers to add LoRA to.
* `--lora_rank`: Rank of LoRA.
* `--lora_checkpoint`: Path to the LoRA checkpoint. If provided, LoRA will be loaded from this checkpoint.
* Extra Model Inputs
* `--extra_inputs`: Extra model inputs, separated by commas.
* VRAM Management
* `--use_gradient_checkpointing`: Whether to enable gradient checkpointing.
* `--use_gradient_checkpointing_offload`: Whether to offload gradient checkpointing to CPU memory.
* `--gradient_accumulation_steps`: Number of gradient accumulation steps.
* Others
* `--align_to_opensource_format`: Whether to align the FLUX DiT LoRA format with the open-source version. Only works for LoRA training.
In addition, the training framework is built on [`accelerate`](https://huggingface.co/docs/accelerate/index ). Run `accelerate config` before training to set GPU-related parameters. For some training scripts (e.g., full model training), we provide suggested `accelerate` config files. You can find them in the corresponding training scripts.
</details>
<details>
<summary>Step 1: Prepare Dataset</summary>
A dataset contains a series of files. We suggest organizing your dataset like this:
```
data/example_image_dataset/
├── metadata.csv
├── image1.jpg
└── image2.jpg
```
Here, `image1.jpg` and `image2.jpg` are training images, and `metadata.csv` is the metadata list, for example:
```
image,prompt
image1.jpg,"a cat is sleeping"
image2.jpg,"a dog is running"
```
We have built a sample image dataset to help you test. You can download it with the following command:
```shell
modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
```
The dataset supports multiple image formats: `"jpg", "jpeg", "png", "webp"`.
Image size can be controlled by script arguments `--height` and `--width`. When `--height` and `--width` are left empty, dynamic resolution is enabled. The model will train using each image's actual width and height from the dataset.
**We strongly recommend using fixed resolution for training, because there can be load balancing issues in multi-GPU training.**
When the model needs extra inputs, for example, `kontext_images` required by controllable models like [`black-forest-labs/FLUX.1-Kontext-dev`](https://modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev ), add the corresponding column to your dataset, for example:
```
image,prompt,kontext_images
image1.jpg,"a cat is sleeping",image1_reference.jpg
```
If an extra input includes image files, you must specify the column name in the `--data_file_keys` argument. Add column names as needed, for example `--data_file_keys "image,kontext_images"`, and also enable `--extra_inputs "kontext_images"`.
</details>
<details>
<summary>Step 2: Load Model</summary>
Similar to model loading during inference, you can configure which models to load directly using model IDs. For example, during inference we load the model with this setting:
```python
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
]
```
Then, during training, use the following parameter to load the same models:
```shell
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors"
```
If you want to load models from local files, for example, during inference:
```python
model_configs=[
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/text_encoder/model.safetensors"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/text_encoder_2/"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/ae.safetensors"),
]
```
Then during training, set it as:
```shell
--model_paths '[
"models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors",
"models/black-forest-labs/FLUX.1-dev/text_encoder/model.safetensors",
"models/black-forest-labs/FLUX.1-dev/text_encoder_2/",
"models/black-forest-labs/FLUX.1-dev/ae.safetensors"
]' \
```
</details>
<details>
<summary>Step 3: Set Trainable Modules</summary>
The training framework supports training base models or LoRA models. Here are some examples:
* Full training of the DiT part: `--trainable_models dit`
* Training a LoRA model on the DiT part: `--lora_base_model dit --lora_target_modules "a_to_qkv,b_to_qkv,ff_a.0,ff_a.2,ff_b.0,ff_b.2,a_to_out,b_to_out,proj_out,norm.linear,norm1_a.linear,norm1_b.linear,to_qkv_mlp" --lora_rank 32`
Also, because the training script loads multiple modules (text encoder, dit, vae), you need to remove prefixes when saving model files. For example, when fully training the DiT part or training a LoRA model on the DiT part, set `--remove_prefix_in_ckpt pipe.dit.`
</details>
<details>
<summary>Step 4: Start Training</summary>
We have written training commands for each model. Please refer to the table at the beginning of this document.
</details>

View File

@@ -1,396 +0,0 @@
# FLUX
[Switch to English](./README.md)
FLUX 是由 Black-Forest-Labs 开源的一系列图像生成模型。
**DiffSynth-Studio 启用了新的推理和训练框架,如需使用旧版本,请点击[这里](https://github.com/modelscope/DiffSynth-Studio/tree/3edf3583b1f08944cee837b94d9f84d669c2729c)。**
## 安装
在使用本系列模型之前,请通过源码安装 DiffSynth-Studio。
```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
## 快速开始
通过运行以下代码可以快速加载 [black-forest-labs/FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev) 模型并进行推理。
```python
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")
```
## 模型总览
|模型 ID|额外参数|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|-|-|-|-|-|-|-|-|
|[FLUX.1-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-dev)||[code](./model_inference/FLUX.1-dev.py)|[code](./model_inference_low_vram/FLUX.1-dev.py)|[code](./model_training/full/FLUX.1-dev.sh)|[code](./model_training/validate_full/FLUX.1-dev.py)|[code](./model_training/lora/FLUX.1-dev.sh)|[code](./model_training/validate_lora/FLUX.1-dev.py)|
|[FLUX.1-Krea-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Krea-dev)||[code](./model_inference/FLUX.1-Krea-dev.py)|[code](./model_inference_low_vram/FLUX.1-Krea-dev.py)|[code](./model_training/full/FLUX.1-Krea-dev.sh)|[code](./model_training/validate_full/FLUX.1-Krea-dev.py)|[code](./model_training/lora/FLUX.1-Krea-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Krea-dev.py)|
|[FLUX.1-Kontext-dev](https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev)|`kontext_images`|[code](./model_inference/FLUX.1-Kontext-dev.py)|[code](./model_inference_low_vram/FLUX.1-Kontext-dev.py)|[code](./model_training/full/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_full/FLUX.1-Kontext-dev.py)|[code](./model_training/lora/FLUX.1-Kontext-dev.sh)|[code](./model_training/validate_lora/FLUX.1-Kontext-dev.py)|
|[FLUX.1-dev-Controlnet-Inpainting-Beta](https://www.modelscope.cn/models/alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Inpainting-Beta.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Inpainting-Beta.py)|
|[FLUX.1-dev-Controlnet-Union-alpha](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-Controlnet-Union-alpha)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Union-alpha.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Union-alpha.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Union-alpha.py)|
|[FLUX.1-dev-Controlnet-Upscaler](https://www.modelscope.cn/models/jasperai/Flux.1-dev-Controlnet-Upscaler)|`controlnet_inputs`|[code](./model_inference/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_inference_low_vram/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/full/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_full/FLUX.1-dev-Controlnet-Upscaler.py)|[code](./model_training/lora/FLUX.1-dev-Controlnet-Upscaler.sh)|[code](./model_training/validate_lora/FLUX.1-dev-Controlnet-Upscaler.py)|
|[FLUX.1-dev-IP-Adapter](https://www.modelscope.cn/models/InstantX/FLUX.1-dev-IP-Adapter)|`ipadapter_images`, `ipadapter_scale`|[code](./model_inference/FLUX.1-dev-IP-Adapter.py)|[code](./model_inference_low_vram/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/full/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_full/FLUX.1-dev-IP-Adapter.py)|[code](./model_training/lora/FLUX.1-dev-IP-Adapter.sh)|[code](./model_training/validate_lora/FLUX.1-dev-IP-Adapter.py)|
|[FLUX.1-dev-InfiniteYou](https://www.modelscope.cn/models/ByteDance/InfiniteYou)|`infinityou_id_image`, `infinityou_guidance`, `controlnet_inputs`|[code](./model_inference/FLUX.1-dev-InfiniteYou.py)|[code](./model_inference_low_vram/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/full/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_full/FLUX.1-dev-InfiniteYou.py)|[code](./model_training/lora/FLUX.1-dev-InfiniteYou.sh)|[code](./model_training/validate_lora/FLUX.1-dev-InfiniteYou.py)|
|[FLUX.1-dev-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Eligen)|`eligen_entity_prompts`, `eligen_entity_masks`, `eligen_enable_on_negative`, `eligen_enable_inpaint`|[code](./model_inference/FLUX.1-dev-EliGen.py)|[code](./model_inference_low_vram/FLUX.1-dev-EliGen.py)|-|-|[code](./model_training/lora/FLUX.1-dev-EliGen.sh)|[code](./model_training/validate_lora/FLUX.1-dev-EliGen.py)|
|[FLUX.1-dev-LoRA-Encoder](https://www.modelscope.cn/models/DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev)|`lora_encoder_inputs`, `lora_encoder_scale`|[code](./model_inference/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_inference_low_vram/FLUX.1-dev-LoRA-Encoder.py)|[code](./model_training/full/FLUX.1-dev-LoRA-Encoder.sh)|[code](./model_training/validate_full/FLUX.1-dev-LoRA-Encoder.py)|-|-|
|[FLUX.1-dev-LoRA-Fusion-Preview](https://modelscope.cn/models/DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev)||[code](./model_inference/FLUX.1-dev-LoRA-Fusion.py)|-|-|-|-|-|
|[Step1X-Edit](https://www.modelscope.cn/models/stepfun-ai/Step1X-Edit)|`step1x_reference_image`|[code](./model_inference/Step1X-Edit.py)|[code](./model_inference_low_vram/Step1X-Edit.py)|[code](./model_training/full/Step1X-Edit.sh)|[code](./model_training/validate_full/Step1X-Edit.py)|[code](./model_training/lora/Step1X-Edit.sh)|[code](./model_training/validate_lora/Step1X-Edit.py)|
|[FLEX.2-preview](https://www.modelscope.cn/models/ostris/Flex.2-preview)|`flex_inpaint_image`, `flex_inpaint_mask`, `flex_control_image`, `flex_control_strength`, `flex_control_stop`|[code](./model_inference/FLEX.2-preview.py)|[code](./model_inference_low_vram/FLEX.2-preview.py)|[code](./model_training/full/FLEX.2-preview.sh)|[code](./model_training/validate_full/FLEX.2-preview.py)|[code](./model_training/lora/FLEX.2-preview.sh)|[code](./model_training/validate_lora/FLEX.2-preview.py)|
|[Nexus-Gen](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2)|`nexus_gen_reference_image`|[code](./model_inference/Nexus-Gen-Editing.py)|[code](./model_inference_low_vram/Nexus-Gen-Editing.py)|[code](./model_training/full/Nexus-Gen.sh)|[code](./model_training/validate_full/Nexus-Gen.py)|[code](./model_training/lora/Nexus-Gen.sh)|[code](./model_training/validate_lora/Nexus-Gen.py)|
## 模型推理
以下部分将会帮助您理解我们的功能并编写推理代码。
<details>
<summary>加载模型</summary>
模型通过 `from_pretrained` 加载:
```python
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
```
其中 `torch_dtype``device` 是计算精度和计算设备。`model_configs` 可通过多种方式配置模型路径:
* 从[魔搭社区](https://modelscope.cn/)下载模型并加载。此时需要填写 `model_id``origin_file_pattern`,例如
```python
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors")
```
* 从本地文件路径加载模型。此时需要填写 `path`,例如
```python
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors")
```
对于从多个文件加载的单一模型,使用列表即可,例如
```python
ModelConfig(path=[
"models/xxx/diffusion_pytorch_model-00001-of-00003.safetensors",
"models/xxx/diffusion_pytorch_model-00002-of-00003.safetensors",
"models/xxx/diffusion_pytorch_model-00003-of-00003.safetensors",
])
```
`ModelConfig` 还提供了额外的参数用于控制模型加载时的行为:
* `local_model_path`: 用于保存下载模型的路径,默认值为 `"./models"`
* `skip_download`: 是否跳过下载,默认值为 `False`。当您的网络无法访问[魔搭社区](https://modelscope.cn/)时,请手动下载必要的文件,并将其设置为 `True`
</details>
<details>
<summary>显存管理</summary>
DiffSynth-Studio 为 FLUX 模型提供了细粒度的显存管理,让模型能够在低显存设备上进行推理,可通过以下代码开启 offload 功能,在显存有限的设备上将部分模块 offload 到内存中。
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu"),
],
)
pipe.enable_vram_management()
```
FP8 量化功能也是支持的:
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_dtype=torch.float8_e4m3fn),
],
)
pipe.enable_vram_management()
```
FP8 量化和 offload 可同时开启:
```python
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
],
)
pipe.enable_vram_management()
```
开启显存管理后,框架会自动根据设备上的剩余显存确定显存管理策略。对于大多数 FLUX 系列模型,最低 8GB 显存即可进行推理。`enable_vram_management` 函数提供了以下参数,用于手动控制显存管理策略:
* `vram_limit`: 显存占用量限制GB默认占用设备上的剩余显存。注意这不是一个绝对限制当设置的显存不足以支持模型进行推理但实际可用显存足够时将会以最小化显存占用的形式进行推理。将其设置为0时将会实现理论最小显存占用。
* `vram_buffer`: 显存缓冲区大小GB默认为 0.5GB。由于部分较大的神经网络层在 onload 阶段会不可控地占用更多显存,因此一个显存缓冲区是必要的,理论上的最优值为模型中最大的层所占的显存。
* `num_persistent_param_in_dit`: DiT 模型中常驻显存的参数数量(个),默认为无限制。我们将会在未来删除这个参数,请不要依赖这个参数。
</details>
<details>
<summary>推理加速</summary>
* TeaCache加速技术 [TeaCache](https://github.com/ali-vilab/TeaCache),请参考[示例代码](./acceleration/teacache.py)。
</details>
<details>
<summary>输入参数</summary>
Pipeline 在推理阶段能够接收以下输入参数:
* `prompt`: 提示词,描述画面中出现的内容。
* `negative_prompt`: 负向提示词,描述画面中不应该出现的内容,默认值为 `""`
* `cfg_scale`: Classifier-free guidance 的参数,默认值为 1当设置为大于1的数值时生效。
* `embedded_guidance`: FLUX-dev 的内嵌引导参数,默认值为 3.5。
* `t5_sequence_length`: T5 模型的文本向量序列长度,默认值为 512。
* `input_image`: 输入图像,用于图生图,该参数与 `denoising_strength` 配合使用。
* `denoising_strength`: 去噪强度,范围是 01默认值为 1当数值接近 0 时,生成图像与输入图像相似;当数值接近 1 时,生成图像与输入图像相差更大。在不输入 `input_image` 参数时,请不要将其设置为非 1 的数值。
* `height`: 图像高度,需保证高度为 16 的倍数。
* `width`: 图像宽度,需保证宽度为 16 的倍数。
* `seed`: 随机种子。默认为 `None`,即完全随机。
* `rand_device`: 生成随机高斯噪声矩阵的计算设备,默认为 `"cpu"`。当设置为 `cuda` 时,在不同 GPU 上会导致不同的生成结果。
* `sigma_shift`: Rectified Flow 理论中的参数,默认为 3。数值越大模型在去噪的开始阶段停留的步骤数越多可适当调大这个参数来提高画面质量但会因生成过程与训练过程不一致导致生成的图像内容与训练数据存在差异。
* `num_inference_steps`: 推理次数,默认值为 30。
* `kontext_images`: Kontext 模型的输入图像。
* `controlnet_inputs`: ControlNet 模型的输入。
* `ipadapter_images`: IP-Adapter 模型的输入图像。
* `ipadapter_scale`: IP-Adapter 模型的控制强度。
* `eligen_entity_prompts`: EliGen 模型的图像局部提示词。
* `eligen_entity_masks`: EliGen 模型的局部提示词控制区域,与 `eligen_entity_prompts` 一一对应。
* `eligen_enable_on_negative`: 是否在负向提示词一侧启用 EliGen仅在 `cfg_scale > 1` 时生效。
* `eligen_enable_inpaint`: 是否启用 EliGen 局部重绘。
* `infinityou_id_image`: InfiniteYou 模型的人脸图像。
* `infinityou_guidance`: InfiniteYou 模型的控制强度。
* `flex_inpaint_image`: FLEX 模型用于局部重绘的图像。
* `flex_inpaint_mask`: FLEX 模型用于局部重绘的区域。
* `flex_control_image`: FLEX 模型用于结构控制的图像。
* `flex_control_strength`: FLEX 模型用于结构控制的强度。
* `flex_control_stop`: FLEX 模型结构控制的结束点1表示全程启用0.5表示在前半段启用0表示不启用。
* `step1x_reference_image`: Step1x-Edit 模型用于图像编辑的输入图像。
* `lora_encoder_inputs`: LoRA 编码器的输入,格式为 ModelConfig 或本地路径。
* `lora_encoder_scale`: LoRA 编码器的激活强度默认值为1数值越小LoRA 激活越弱。
* `tea_cache_l1_thresh`: TeaCache 的阈值,数值越大,速度越快,画面质量越差。请注意,开启 TeaCache 后推理速度并非均匀,因此进度条上显示的剩余时间将会变得不准确。
* `tiled`: 是否启用 VAE 分块推理,默认为 `False`。设置为 `True` 时可显著减少 VAE 编解码阶段的显存占用,会产生少许误差,以及少量推理时间延长。
* `tile_size`: VAE 编解码阶段的分块大小,默认为 128仅在 `tiled=True` 时生效。
* `tile_stride`: VAE 编解码阶段的分块步长,默认为 64仅在 `tiled=True` 时生效,需保证其数值小于或等于 `tile_size`
* `progress_bar_cmd`: 进度条,默认为 `tqdm.tqdm`。可通过设置为 `lambda x:x` 来屏蔽进度条。
</details>
## 模型训练
FLUX 系列模型训练通过统一的 [`./model_training/train.py`](./model_training/train.py) 脚本进行。
<details>
<summary>脚本参数</summary>
脚本包含以下参数:
* 数据集
* `--dataset_base_path`: 数据集的根路径。
* `--dataset_metadata_path`: 数据集的元数据文件路径。
* `--max_pixels`: 最大像素面积,默认为 1024*1024当启用动态分辨率时任何分辨率大于这个数值的图片都会被缩小。
* `--height`: 图像或视频的高度。将 `height``width` 留空以启用动态分辨率。
* `--width`: 图像或视频的宽度。将 `height``width` 留空以启用动态分辨率。
* `--data_file_keys`: 元数据中的数据文件键。用逗号分隔。
* `--dataset_repeat`: 每个 epoch 中数据集重复的次数。
* `--dataset_num_workers`: 每个 Dataloder 的进程数量。
* 模型
* `--model_paths`: 要加载的模型路径。JSON 格式。
* `--model_id_with_origin_paths`: 带原始路径的模型 ID例如 black-forest-labs/FLUX.1-dev:flux1-dev.safetensors。用逗号分隔。
* 训练
* `--learning_rate`: 学习率。
* `--weight_decay`:权重衰减大小。
* `--num_epochs`: 轮数Epoch
* `--output_path`: 保存路径。
* `--remove_prefix_in_ckpt`: 在 ckpt 中移除前缀。
* `--save_steps`: 保存模型的间隔 step 数量,如果设置为 None ,则每个 epoch 保存一次
* `--find_unused_parameters`: DDP 训练中是否存在未使用的参数
* 可训练模块
* `--trainable_models`: 可训练的模型,例如 dit、vae、text_encoder。
* `--lora_base_model`: LoRA 添加到哪个模型上。
* `--lora_target_modules`: LoRA 添加到哪一层上。
* `--lora_rank`: LoRA 的秩Rank
* `--lora_checkpoint`: LoRA 检查点的路径。如果提供此路径LoRA 将从此检查点加载。
* 额外模型输入
* `--extra_inputs`: 额外的模型输入,以逗号分隔。
* 显存管理
* `--use_gradient_checkpointing`: 是否启用 gradient checkpointing。
* `--use_gradient_checkpointing_offload`: 是否将 gradient checkpointing 卸载到内存中。
* `--gradient_accumulation_steps`: 梯度累积步数。
* 其他
* `--align_to_opensource_format`: 是否将 FLUX DiT LoRA 的格式与开源版本对齐,仅对 LoRA 训练生效。
此外,训练框架基于 [`accelerate`](https://huggingface.co/docs/accelerate/index) 构建,在开始训练前运行 `accelerate config` 可配置 GPU 的相关参数。对于部分模型训练(例如模型的全量训练)脚本,我们提供了建议的 `accelerate` 配置文件,可在对应的训练脚本中查看。
</details>
<details>
<summary>Step 1: 准备数据集</summary>
数据集包含一系列文件,我们建议您这样组织数据集文件:
```
data/example_image_dataset/
├── metadata.csv
├── image1.jpg
└── image2.jpg
```
其中 `image1.jpg``image2.jpg` 为训练用图像数据,`metadata.csv` 为元数据列表,例如
```
image,prompt
image1.jpg,"a cat is sleeping"
image2.jpg,"a dog is running"
```
我们构建了一个样例图像数据集,以方便您进行测试,通过以下命令可以下载这个数据集:
```shell
modelscope download --dataset DiffSynth-Studio/example_image_dataset --local_dir ./data/example_image_dataset
```
数据集支持多种图片格式,`"jpg", "jpeg", "png", "webp"`
图片的尺寸可通过脚本参数 `--height``--width` 控制。当 `--height``--width` 为空时将会开启动态分辨率,按照数据集中每个图像的实际宽高训练。
**我们强烈建议使用固定分辨率训练,因为在多卡训练中存在负载均衡问题。**
当模型需要额外输入时,例如具备控制能力的模型 [`black-forest-labs/FLUX.1-Kontext-dev`](https://modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev) 所需的 `kontext_images`,请在数据集中补充相应的列,例如:
```
image,prompt,kontext_images
image1.jpg,"a cat is sleeping",image1_reference.jpg
```
额外输入若包含图像文件,则需要在 `--data_file_keys` 参数中指定要解析的列名。可根据额外输入增加相应的列名,例如 `--data_file_keys "image,kontext_images"`,同时启用 `--extra_inputs "kontext_images"`
</details>
<details>
<summary>Step 2: 加载模型</summary>
类似于推理时的模型加载逻辑,可直接通过模型 ID 配置要加载的模型。例如,推理时我们通过以下设置加载模型
```python
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
]
```
那么在训练时,填入以下参数即可加载对应的模型。
```shell
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors"
```
如果您希望从本地文件加载模型,例如推理时
```python
model_configs=[
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/text_encoder/model.safetensors"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/text_encoder_2/"),
ModelConfig(path="models/black-forest-labs/FLUX.1-dev/ae.safetensors"),
]
```
那么训练时需设置为
```shell
--model_paths '[
"models/black-forest-labs/FLUX.1-dev/flux1-dev.safetensors",
"models/black-forest-labs/FLUX.1-dev/text_encoder/model.safetensors",
"models/black-forest-labs/FLUX.1-dev/text_encoder_2/",
"models/black-forest-labs/FLUX.1-dev/ae.safetensors"
]' \
```
</details>
<details>
<summary>Step 3: 设置可训练模块</summary>
训练框架支持训练基础模型,或 LoRA 模型。以下是几个例子:
* 全量训练 DiT 部分:`--trainable_models dit`
* 训练 DiT 部分的 LoRA 模型:`--lora_base_model dit --lora_target_modules "a_to_qkv,b_to_qkv,ff_a.0,ff_a.2,ff_b.0,ff_b.2,a_to_out,b_to_out,proj_out,norm.linear,norm1_a.linear,norm1_b.linear,to_qkv_mlp" --lora_rank 32`
此外由于训练脚本中加载了多个模块text encoder、dit、vae保存模型文件时需要移除前缀例如在全量训练 DiT 部分或者训练 DiT 部分的 LoRA 模型时,请设置 `--remove_prefix_in_ckpt pipe.dit.`
</details>
<details>
<summary>Step 4: 启动训练程序</summary>
我们为每一个模型编写了训练命令,请参考本文档开头的表格。
</details>

View File

@@ -1,24 +0,0 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
for tea_cache_l1_thresh in [None, 0.2, 0.4, 0.6, 0.8]:
image = pipe(
prompt=prompt, embedded_guidance=3.5, seed=0,
num_inference_steps=50, tea_cache_l1_thresh=tea_cache_l1_thresh
)
image.save(f"image_{tea_cache_l1_thresh}.png")

View File

@@ -1,6 +1,6 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.controlnets.processors import Annotator
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from diffsynth.utils.controlnet import Annotator
import numpy as np
from PIL import Image
@@ -11,7 +11,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="ostris/Flex.2-preview", origin_file_pattern="Flex.2-preview.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
@@ -21,12 +21,12 @@ image = pipe(
num_inference_steps=50, embedded_guidance=3.5,
seed=0
)
image.save(f"image_1.jpg")
image.save("image_1.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[200:400, 400:700] = 255
mask = Image.fromarray(mask)
mask.save(f"image_mask.jpg")
mask.save("image_mask.jpg")
inpaint_image = image
@@ -36,7 +36,7 @@ image = pipe(
flex_inpaint_image=inpaint_image, flex_inpaint_mask=mask,
seed=4
)
image.save(f"image_2_new.jpg")
image.save("image_2.jpg")
control_image = Annotator("canny")(image)
control_image.save("image_control.jpg")
@@ -47,4 +47,4 @@ image = pipe(
flex_control_image=control_image,
seed=4
)
image.save(f"image_3_new.jpg")
image.save("image_3.jpg")

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from PIL import Image
@@ -9,7 +9,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-Kontext-dev", origin_file_pattern="flux1-kontext-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
@@ -8,7 +8,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-Krea-dev", origin_file_pattern="flux1-krea-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
@@ -8,7 +8,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="DiffSynth-Studio/AttriCtrl-FLUX.1-Dev", origin_file_pattern="models/brightness.safetensors")
],

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
import numpy as np
from PIL import Image
@@ -10,7 +10,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta", origin_file_pattern="diffusion_pytorch_model.safetensors"),
],

View File

@@ -1,18 +1,18 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.controlnets.processors import Annotator
from diffsynth import download_models
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.utils.controlnet import Annotator
from modelscope import snapshot_download
download_models(["Annotators:Depth"])
snapshot_download("sd_lora/Annotators", allow_file_pattern="dpt_hybrid-midas-501f0c75.pt", local_dir="models/Annotators")
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="InstantX/FLUX.1-dev-Controlnet-Union-alpha", origin_file_pattern="diffusion_pytorch_model.safetensors"),
],

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
pipe = FluxImagePipeline.from_pretrained(
@@ -8,7 +8,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="jasperai/Flux.1-dev-Controlnet-Upscaler", origin_file_pattern="diffusion_pytorch_model.safetensors"),
],

View File

@@ -1,8 +1,7 @@
import random
import torch
from PIL import Image, ImageDraw, ImageFont
from diffsynth import download_customized_models
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from modelscope import dataset_snapshot_download
@@ -91,24 +90,11 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)
download_from_modelscope = True
if download_from_modelscope:
model_id = "DiffSynth-Studio/Eligen"
downloading_priority = ["ModelScope"]
else:
model_id = "modelscope/EliGen"
downloading_priority = ["HuggingFace"]
EliGen_path = download_customized_models(
model_id=model_id,
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control",
downloading_priority=downloading_priority)[0]
pipe.load_lora(pipe.dit, EliGen_path, alpha=1)
pipe.load_lora(pipe.dit, ModelConfig(model_id="DiffSynth-Studio/Eligen", origin_file_pattern="model_bf16.safetensors"), alpha=1)
# example 1
global_prompt = "A breathtaking beauty of Raja Ampat by the late-night moonlight , one beautiful woman from behind wearing a pale blue long dress with soft glow, sitting at the top of a cliff looking towards the beach,pastell light colors, a group of small distant birds flying in far sky, a boat sailing on the sea, best quality, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, soft pastell colors,ambient occlusion, cozy ambient lighting, masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning, view from above\n"

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
@@ -8,10 +8,10 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="InstantX/FLUX.1-dev-IP-Adapter", origin_file_pattern="ip-adapter.bin"),
ModelConfig(model_id="google/siglip-so400m-patch14-384"),
ModelConfig(model_id="google/siglip-so400m-patch14-384", origin_file_pattern="model.safetensors"),
],
)

View File

@@ -1,11 +1,13 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
from modelscope import dataset_snapshot_download
from modelscope import snapshot_download
from PIL import Image
import numpy as np
# This model has additional requirements.
# Please install the following packages.
# pip install facexlib insightface onnxruntime
snapshot_download(
"ByteDance/InfiniteYou",
allow_file_pattern="supports/insightface/models/antelopev2/*",
@@ -17,7 +19,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/image_proj_model.bin"),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/InfuseNetModel/*.safetensors"),

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
@@ -8,15 +8,13 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev", origin_file_pattern="model.safetensors"),
],
)
pipe.enable_lora_magic()
lora = ModelConfig(model_id="VoidOc/flux_animal_forest1", origin_file_pattern="20.safetensors")
pipe.load_lora(pipe.dit, lora, hotload=True) # Use `pipe.clear_lora()` to drop the loaded LoRA.
pipe.load_lora(pipe.dit, lora) # Use `pipe.clear_lora()` to drop the loaded LoRA.
# Empty prompt can automatically activate LoRA capabilities.
image = pipe(prompt="", seed=0, lora_encoder_inputs=lora)

View File

@@ -1,29 +1,38 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
# Enable lora hotloading
"offload_dtype": torch.bfloat16,
"offload_device": "cuda",
"onload_dtype": torch.bfloat16,
"onload_device": "cuda",
"preparing_dtype": torch.bfloat16,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
ModelConfig(model_id="DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev", origin_file_pattern="model.safetensors"),
],
)
pipe.enable_lora_magic()
pipe.enable_lora_merger()
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="cancel13/cxsk", origin_file_pattern="30.safetensors"),
hotload=True,
)
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="DiffSynth-Studio/ArtAug-lora-FLUX.1dev-v1", origin_file_pattern="merged_lora.safetensors"),
hotload=True,
)
image = pipe(prompt="a cat", seed=0)
image.save("image_fused.jpg")

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
@@ -8,7 +8,7 @@ pipe = FluxImagePipeline.from_pretrained(
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
)

View File

@@ -1,7 +1,7 @@
import importlib
import torch
from PIL import Image
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from modelscope import dataset_snapshot_download
@@ -19,7 +19,7 @@ pipe = FluxImagePipeline.from_pretrained(
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="edit_decoder.bin"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
nexus_gen_processor_config=ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="processor/"),

View File

@@ -1,6 +1,6 @@
import importlib
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
if importlib.util.find_spec("transformers") is None:
@@ -17,7 +17,7 @@ pipe = FluxImagePipeline.from_pretrained(
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="generation_decoder.bin"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
],
nexus_gen_processor_config=ModelConfig("DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="processor"),

View File

@@ -1,5 +1,5 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from PIL import Image
import numpy as np
@@ -8,7 +8,7 @@ pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="Qwen/Qwen2.5-VL-7B-Instruct"),
ModelConfig(model_id="Qwen/Qwen2.5-VL-7B-Instruct", origin_file_pattern="model-*.safetensors"),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="step1x-edit-i1258.safetensors"),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="vae.safetensors"),
],

View File

@@ -1,33 +1,43 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.controlnets.processors import Annotator
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from diffsynth.utils.controlnet import Annotator
import numpy as np
from PIL import Image
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="ostris/Flex.2-preview", origin_file_pattern="Flex.2-preview.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="ostris/Flex.2-preview", origin_file_pattern="Flex.2-preview.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image = pipe(
prompt="portrait of a beautiful Asian girl, long hair, red t-shirt, sunshine, beach",
num_inference_steps=50, embedded_guidance=3.5,
seed=0
)
image.save(f"image_1.jpg")
image.save("image_1.jpg")
mask = np.zeros((1024, 1024, 3), dtype=np.uint8)
mask[200:400, 400:700] = 255
mask = Image.fromarray(mask)
mask.save(f"image_mask.jpg")
mask.save("image_mask.jpg")
inpaint_image = image
@@ -37,7 +47,7 @@ image = pipe(
flex_inpaint_image=inpaint_image, flex_inpaint_mask=mask,
seed=4
)
image.save(f"image_2_new.jpg")
image.save("image_2.jpg")
control_image = Annotator("canny")(image)
control_image.save("image_control.jpg")
@@ -48,4 +58,4 @@ image = pipe(
flex_control_image=control_image,
seed=4
)
image.save(f"image_3_new.jpg")
image.save("image_3.jpg")

View File

@@ -1,19 +1,29 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from PIL import Image
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-Kontext-dev", origin_file_pattern="flux1-kontext-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-Kontext-dev", origin_file_pattern="flux1-kontext-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image_1 = pipe(
prompt="a beautiful Asian long-haired female college student.",

View File

@@ -1,18 +1,28 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-Krea-dev", origin_file_pattern="flux1-krea-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-Krea-dev", origin_file_pattern="flux1-krea-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
prompt = "An beautiful woman is riding a bicycle in a park, wearing a red dress"
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"

View File

@@ -1,19 +1,29 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="DiffSynth-Studio/AttriCtrl-FLUX.1-Dev", origin_file_pattern="models/brightness.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn)
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/AttriCtrl-FLUX.1-Dev", origin_file_pattern="models/brightness.safetensors", **vram_config)
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
for i in [0.1, 0.3, 0.5, 0.7, 0.9]:
image = pipe(prompt="a cat on the beach", seed=2, value_controller_inputs=[i])

View File

@@ -1,21 +1,31 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
import numpy as np
from PIL import Image
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta", origin_file_pattern="diffusion_pytorch_model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta", origin_file_pattern="diffusion_pytorch_model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image_1 = pipe(
prompt="a cat sitting on a chair",

View File

@@ -1,23 +1,32 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.controlnets.processors import Annotator
from diffsynth import download_models
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.utils.controlnet import Annotator
from modelscope import snapshot_download
download_models(["Annotators:Depth"])
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
snapshot_download("sd_lora/Annotators", allow_file_pattern="dpt_hybrid-midas-501f0c75.pt", local_dir="models/Annotators")
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="InstantX/FLUX.1-dev-Controlnet-Union-alpha", origin_file_pattern="diffusion_pytorch_model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="InstantX/FLUX.1-dev-Controlnet-Union-alpha", origin_file_pattern="diffusion_pytorch_model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image_1 = pipe(
prompt="a beautiful Asian girl, full body, red dress, summer",

View File

@@ -1,19 +1,29 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="jasperai/Flux.1-dev-Controlnet-Upscaler", origin_file_pattern="diffusion_pytorch_model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="jasperai/Flux.1-dev-Controlnet-Upscaler", origin_file_pattern="diffusion_pytorch_model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image_1 = pipe(
prompt="a photo of a cat, highly detailed",

View File

@@ -1,11 +1,20 @@
import random
import torch
from PIL import Image, ImageDraw, ImageFont
from diffsynth import download_customized_models
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from modelscope import dataset_snapshot_download
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
def visualize_masks(image, masks, mask_prompts, output_path, font_size=35, use_random_colors=False):
# Create a blank image for overlays
overlay = Image.new('RGBA', image.size, (0, 0, 0, 0))
@@ -89,27 +98,14 @@ pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
download_from_modelscope = True
if download_from_modelscope:
model_id = "DiffSynth-Studio/Eligen"
downloading_priority = ["ModelScope"]
else:
model_id = "modelscope/EliGen"
downloading_priority = ["HuggingFace"]
EliGen_path = download_customized_models(
model_id=model_id,
origin_file_path="model_bf16.safetensors",
local_dir="models/lora/entity_control",
downloading_priority=downloading_priority)[0]
pipe.load_lora(pipe.dit, EliGen_path, alpha=1)
pipe.load_lora(pipe.dit, ModelConfig(model_id="DiffSynth-Studio/Eligen", origin_file_pattern="model_bf16.safetensors"), alpha=1)
# example 1
global_prompt = "A breathtaking beauty of Raja Ampat by the late-night moonlight , one beautiful woman from behind wearing a pale blue long dress with soft glow, sitting at the top of a cliff looking towards the beach,pastell light colors, a group of small distant birds flying in far sky, a boat sailing on the sea, best quality, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, soft pastell colors,ambient occlusion, cozy ambient lighting, masterpiece, liiv1, linquivera, metix, mentixis, masterpiece, award winning, view from above\n"

View File

@@ -1,20 +1,30 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="InstantX/FLUX.1-dev-IP-Adapter", origin_file_pattern="ip-adapter.bin", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="google/siglip-so400m-patch14-384", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="InstantX/FLUX.1-dev-IP-Adapter", origin_file_pattern="ip-adapter.bin", **vram_config),
ModelConfig(model_id="google/siglip-so400m-patch14-384", origin_file_pattern="model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
origin_prompt = "a rabbit in a garden, colorful flowers"
image = pipe(prompt=origin_prompt, height=1280, width=960, seed=42)

View File

@@ -1,11 +1,24 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig, ControlNetInput
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig, ControlNetInput
from modelscope import dataset_snapshot_download
from modelscope import snapshot_download
from PIL import Image
import numpy as np
# This model has additional requirements.
# Please install the following packages.
# pip install facexlib insightface onnxruntime
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
snapshot_download(
"ByteDance/InfiniteYou",
allow_file_pattern="supports/insightface/models/antelopev2/*",
@@ -15,15 +28,15 @@ pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/image_proj_model.bin", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/InfuseNetModel/*.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/image_proj_model.bin", **vram_config),
ModelConfig(model_id="ByteDance/InfiniteYou", origin_file_pattern="infu_flux_v1.0/aes_stage2/InfuseNetModel/*.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
dataset_snapshot_download(
dataset_id="DiffSynth-Studio/examples_in_diffsynth",

View File

@@ -1,23 +1,31 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev", origin_file_pattern="model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev", origin_file_pattern="model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
pipe.enable_lora_magic()
lora = ModelConfig(model_id="VoidOc/flux_animal_forest1", origin_file_pattern="20.safetensors")
pipe.load_lora(pipe.dit, lora, hotload=True) # Use `pipe.clear_lora()` to drop the loaded LoRA.
pipe.load_lora(pipe.dit, lora) # Use `pipe.clear_lora()` to drop the loaded LoRA.
# Empty prompt can automatically activate LoRA capabilities.
image = pipe(prompt="", seed=0, lora_encoder_inputs=lora)

View File

@@ -0,0 +1,38 @@
import torch
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev", origin_file_pattern="model.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_lora_merger()
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="cancel13/cxsk", origin_file_pattern="30.safetensors"),
)
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="DiffSynth-Studio/ArtAug-lora-FLUX.1dev-v1", origin_file_pattern="merged_lora.safetensors"),
)
image = pipe(prompt="a cat", seed=0)
image.save("image_fused.jpg")

View File

@@ -1,35 +0,0 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="DiffSynth-Studio/FLUX.1-dev-LoRAFusion", origin_file_pattern="model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn)
],
)
pipe.enable_vram_management()
pipe.enable_lora_patcher()
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="yangyufeng/fgao", origin_file_pattern="30.safetensors"),
hotload=True
)
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="bobooblue/LoRA-bling-mai", origin_file_pattern="10.safetensors"),
hotload=True
)
pipe.load_lora(
pipe.dit,
ModelConfig(model_id="JIETANGAB/E", origin_file_pattern="17.safetensors"),
hotload=True
)
image = pipe(prompt="This is a digital painting in a soft, ethereal style. a beautiful Asian girl Shine like a diamond. Everywhere is shining with bling bling luster.The background is a textured blue with visible brushstrokes, giving the image an impressionistic style reminiscent of Vincent van Gogh's work", seed=0)
image.save("flux.jpg")

View File

@@ -1,18 +1,28 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her."
negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw,"

View File

@@ -1,7 +1,7 @@
import importlib
import torch
from PIL import Image
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from modelscope import dataset_snapshot_download
@@ -12,19 +12,29 @@ else:
assert transformers.__version__ == "4.49.0", "Nexus-GenV2 requires transformers==4.49.0, please install it with `pip install transformers==4.49.0`."
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors", offload_device="cpu"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="edit_decoder.bin", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="edit_decoder.bin", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
nexus_gen_processor_config=ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="processor/"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/nexusgen/cat.jpg")
ref_image = Image.open("data/examples/nexusgen/cat.jpg").convert("RGB")

View File

@@ -1,6 +1,6 @@
import importlib
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
if importlib.util.find_spec("transformers") is None:
@@ -10,19 +10,29 @@ else:
assert transformers.__version__ == "4.49.0", "Nexus-GenV2 requires transformers==4.49.0, please install it with `pip install transformers==4.49.0`."
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors", offload_device="cpu"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="generation_decoder.bin", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/", offload_device="cpu"),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", offload_device="cpu"),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="model*.safetensors", **vram_config),
ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="generation_decoder.bin", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors", **vram_config),
ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config),
],
nexus_gen_processor_config=ModelConfig(model_id="DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="processor/"),
nexus_gen_processor_config=ModelConfig("DiffSynth-Studio/Nexus-GenV2", origin_file_pattern="processor"),
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
prompt = "一只可爱的猫咪"
image = pipe(

View File

@@ -1,19 +1,29 @@
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig
from PIL import Image
import numpy as np
vram_config = {
"offload_dtype": torch.float8_e4m3fn,
"offload_device": "cpu",
"onload_dtype": torch.float8_e4m3fn,
"onload_device": "cpu",
"preparing_dtype": torch.float8_e4m3fn,
"preparing_device": "cuda",
"computation_dtype": torch.bfloat16,
"computation_device": "cuda",
}
pipe = FluxImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="Qwen/Qwen2.5-VL-7B-Instruct", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="step1x-edit-i1258.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="vae.safetensors", offload_device="cpu", offload_dtype=torch.float8_e4m3fn),
ModelConfig(model_id="Qwen/Qwen2.5-VL-7B-Instruct", origin_file_pattern="model-*.safetensors", **vram_config),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="step1x-edit-i1258.safetensors", **vram_config),
ModelConfig(model_id="stepfun-ai/Step1X-Edit", origin_file_pattern="vae.safetensors", **vram_config),
],
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
pipe.enable_vram_management()
image = Image.fromarray(np.zeros((1248, 832, 3), dtype=np.uint8) + 255)
image = pipe(

View File

@@ -3,7 +3,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 200 \
--model_id_with_origin_paths "ostris/Flex.2-preview:Flex.2-preview.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "ostris/Flex.2-preview:Flex.2-preview.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,kontext_images" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Kontext-dev:flux1-kontext-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Kontext-dev:flux1-kontext-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -3,7 +3,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Krea-dev:flux1-krea-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Krea-dev:flux1-krea-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/AttriCtrl-FLUX.1-Dev:models/brightness.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/AttriCtrl-FLUX.1-Dev:models/brightness.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.value_controller.encoders.0." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,controlnet_image,controlnet_inpaint_mask" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta:diffusion_pytorch_model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.controlnet.models.0." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,controlnet_image" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-Controlnet-Union-alpha:diffusion_pytorch_model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-Controlnet-Union-alpha:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.controlnet.models.0." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,controlnet_image" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,jasperai/Flux.1-dev-Controlnet-Upscaler:diffusion_pytorch_model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,jasperai/Flux.1-dev-Controlnet-Upscaler:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.controlnet.models.0." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image,ipadapter_images" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-IP-Adapter:ip-adapter.bin,google/siglip-so400m-patch14-384:" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-IP-Adapter:ip-adapter.bin,google/siglip-so400m-patch14-384:model.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.ipadapter." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,controlnet_image,infinityou_id_image" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,ByteDance/InfiniteYou:infu_flux_v1.0/aes_stage2/image_proj_model.bin,ByteDance/InfiniteYou:infu_flux_v1.0/aes_stage2/InfuseNetModel/*.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,ByteDance/InfiniteYou:infu_flux_v1.0/aes_stage2/image_proj_model.bin,ByteDance/InfiniteYou:infu_flux_v1.0/aes_stage2/InfuseNetModel/*.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev:model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev:model.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.lora_encoder." \

View File

@@ -3,7 +3,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,nexus_gen_reference_image" \
--max_pixels 262144 \
--dataset_repeat 400 \
--model_id_with_origin_paths "DiffSynth-Studio/Nexus-GenV2:model*.safetensors,DiffSynth-Studio/Nexus-GenV2:edit_decoder.bin,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "DiffSynth-Studio/Nexus-GenV2:model*.safetensors,DiffSynth-Studio/Nexus-GenV2:edit_decoder.bin,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch --config_file examples/flux/model_training/full/accelerate_con
--data_file_keys "image,step1x_reference_image" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "Qwen/Qwen2.5-VL-7B-Instruct:,stepfun-ai/Step1X-Edit:step1x-edit-i1258.safetensors,stepfun-ai/Step1X-Edit:vae.safetensors" \
--model_id_with_origin_paths "Qwen/Qwen2.5-VL-7B-Instruct:model-*.safetensors,stepfun-ai/Step1X-Edit:step1x-edit-i1258.safetensors,stepfun-ai/Step1X-Edit:vae.safetensors" \
--learning_rate 1e-5 \
--num_epochs 1 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -3,7 +3,7 @@ accelerate launch examples/flux/model_training/train.py \
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 50 \
--model_id_with_origin_paths "ostris/Flex.2-preview:Flex.2-preview.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "ostris/Flex.2-preview:Flex.2-preview.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image,kontext_images" \
--max_pixels 1048576 \
--dataset_repeat 400 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Kontext-dev:flux1-kontext-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Kontext-dev:flux1-kontext-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -3,7 +3,7 @@ accelerate launch examples/flux/model_training/train.py \
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 50 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Krea-dev:flux1-krea-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-Krea-dev:flux1-krea-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/AttriCtrl-FLUX.1-Dev:models/brightness.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,DiffSynth-Studio/AttriCtrl-FLUX.1-Dev:models/brightness.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image,controlnet_image,controlnet_inpaint_mask" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta:diffusion_pytorch_model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

View File

@@ -4,7 +4,7 @@ accelerate launch examples/flux/model_training/train.py \
--data_file_keys "image,controlnet_image" \
--max_pixels 1048576 \
--dataset_repeat 100 \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-Controlnet-Union-alpha:diffusion_pytorch_model.safetensors" \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/*.safetensors,black-forest-labs/FLUX.1-dev:ae.safetensors,InstantX/FLUX.1-dev-Controlnet-Union-alpha:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \

Some files were not shown because too many files have changed in this diff Show More