mirror of
https://github.com/modelscope/DiffSynth-Studio.git
synced 2026-03-19 14:58:12 +00:00
25
docs/source_en/.readthedocs.yaml
Normal file
25
docs/source_en/.readthedocs.yaml
Normal file
@@ -0,0 +1,25 @@
|
||||
# .readthedocs.yaml
|
||||
# Read the Docs configuration file
|
||||
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
|
||||
|
||||
# Required
|
||||
version: 2
|
||||
|
||||
# Set the version of Python and other tools you might need
|
||||
build:
|
||||
os: ubuntu-22.04
|
||||
tools:
|
||||
python: "3.11"
|
||||
|
||||
python:
|
||||
install:
|
||||
- requirements: docs/source_en/requirement.txt
|
||||
# Build documentation in the docs/ directory with Sphinx
|
||||
sphinx:
|
||||
configuration: docs/source_en/conf.py
|
||||
|
||||
# We recommend specifying your dependencies to enable reproducible builds:
|
||||
# https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
|
||||
# python:
|
||||
# install:
|
||||
# - requirements: docs/requirements.txt
|
||||
82
docs/source_en/GetStarted/A_simple_example.md
Normal file
82
docs/source_en/GetStarted/A_simple_example.md
Normal file
@@ -0,0 +1,82 @@
|
||||
|
||||
# A Simple Example: Text-to-Image Synthesis with Flux
|
||||
|
||||
The following example shows how to use the FLUX.1 model for text-to-image tasks. The script provides a simple setup for generating images from text descriptions. It covers downloading the necessary models, configuring the pipeline, and generating images with and without classifier-free guidance.
|
||||
|
||||
For other models supported by DiffSynth, see [Models.md](Models.md).
|
||||
|
||||
## Setup
|
||||
|
||||
First, ensure you have the necessary models downloaded and configured:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffsynth import ModelManager, FluxImagePipeline, download_models
|
||||
|
||||
# Download the FLUX.1-dev model files
|
||||
download_models(["FLUX.1-dev"])
|
||||
```
|
||||
|
||||
For instructions on downloading models, see [Download_models.md](Download_models.md).
|
||||
|
||||
## Loading Models
|
||||
Initialize the model manager with your device and data type:
|
||||
|
||||
```python
|
||||
model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cuda")
|
||||
model_manager.load_models([
|
||||
"models/FLUX/FLUX.1-dev/text_encoder/model.safetensors",
|
||||
"models/FLUX/FLUX.1-dev/text_encoder_2",
|
||||
"models/FLUX/FLUX.1-dev/ae.safetensors",
|
||||
"models/FLUX/FLUX.1-dev/flux1-dev.safetensors"
|
||||
])
|
||||
```
|
||||
|
||||
For instructions on loading models, see [ModelManager.md](ModelManager.md).
|
||||
|
||||
## Creating the Pipeline
|
||||
Create an instance of the FluxImagePipeline from the loaded model manager:
|
||||
|
||||
|
||||
```python
|
||||
pipe = FluxImagePipeline.from_model_manager(model_manager)
|
||||
```
|
||||
|
||||
For instructions on using the Pipeline, see [Pipeline.md](Pipeline.md).
|
||||
## Text-to-Image Synthesis
|
||||
Generate an image using a short prompt. Below are examples of generating images with and without classifier-free guidance.
|
||||
|
||||
### Basic Generation
|
||||
```python
|
||||
prompt = "A cute little turtle"
|
||||
negative_prompt = ""
|
||||
|
||||
torch.manual_seed(6)
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
num_inference_steps=30, embedded_guidance=3.5
|
||||
)
|
||||
image.save("image_1024.jpg")
|
||||
```
|
||||
|
||||
### Generation with Classifier-Free Guidance
|
||||
```python
|
||||
torch.manual_seed(6)
|
||||
image = pipe(
|
||||
prompt=prompt, negative_prompt=negative_prompt,
|
||||
num_inference_steps=30, cfg_scale=2.0, embedded_guidance=3.5
|
||||
)
|
||||
image.save("image_1024_cfg.jpg")
|
||||
```
|
||||
|
||||
### High-Resolution Fix
|
||||
```python
|
||||
torch.manual_seed(7)
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
num_inference_steps=30, embedded_guidance=3.5,
|
||||
input_image=image.resize((2048, 2048)), height=2048, width=2048, denoising_strength=0.6, tiled=True
|
||||
)
|
||||
image.save("image_2048_highres.jpg")
|
||||
```
|
||||
|
||||
20
docs/source_en/GetStarted/Download_models.md
Normal file
20
docs/source_en/GetStarted/Download_models.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Download Models
|
||||
|
||||
Download the pre-set models. Model IDs can be found in [config file](/diffsynth/configs/model_config.py).
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["FLUX.1-dev", "Kolors"])
|
||||
```
|
||||
|
||||
To download non-pre-set models, you can choose models from either the [ModelScope](https://modelscope.cn/models) or [HuggingFace](https://huggingface.co/models) sources.
|
||||
|
||||
```python
|
||||
from diffsynth.models.downloader import download_from_huggingface, download_from_modelscope
|
||||
|
||||
# From Modelscope (recommended)
|
||||
download_from_modelscope("Kwai-Kolors/Kolors", "vae/diffusion_pytorch_model.fp16.bin", "models/kolors/Kolors/vae")
|
||||
# From Huggingface
|
||||
download_from_huggingface("Kwai-Kolors/Kolors", "vae/diffusion_pytorch_model.fp16.safetensors", "models/kolors/Kolors/vae")
|
||||
```
|
||||
10
docs/source_en/GetStarted/Extensions.md
Normal file
10
docs/source_en/GetStarted/Extensions.md
Normal file
@@ -0,0 +1,10 @@
|
||||
# Extensions
|
||||
|
||||
This document introduces some relevant techniques beyond the diffusion models implemented in DiffSynth, which have significant application potential in image and video processing.
|
||||
|
||||
- **[RIFE](https://github.com/hzwer/ECCV2022-RIFE)**: FIRE (Real-Time Intermediate Flow Estimation Algorithm) is a frame interpolation (VFI) method based on real-time intermediate flow estimation. It includes an end-to-end efficient intermediate flow estimation network called IFNet, as well as an optical flow supervision framework based on privileged distillation. RIFE supports inserting frames at any moment between two frames, achieving state-of-the-art performance across multiple datasets without relying on any pre-trained models.
|
||||
|
||||
- **[ESRGAN](https://github.com/xinntao/ESRGAN)**: ESRGAN (Enhanced Super Resolution Generative Adversarial Network) is an improved method based on SRGAN, aimed at enhancing the visual quality of single image super-resolution. This approach significantly improves the realism of generated images by optimizing three key components of SRGAN: network architecture, adversarial loss, and perceptual loss.
|
||||
|
||||
- **[FastBlend](https://arxiv.org/abs/2311.09265)**: FastBlend is a model-free toolkit designed for smoothing videos, integrated with Diffusion models to create a powerful video processing workflow. This tool effectively eliminates flickering in videos, performs interpolation on keyframe sequences, and can process complete videos based on a single image.
|
||||
|
||||
426
docs/source_en/GetStarted/Fine-tuning.md
Normal file
426
docs/source_en/GetStarted/Fine-tuning.md
Normal file
@@ -0,0 +1,426 @@
|
||||
# Fine-Tuning
|
||||
|
||||
We have implemented a training framework for text-to-image Diffusion models, enabling users to easily train LoRA models using our framework. Our provided scripts come with the following advantages:
|
||||
|
||||
* **Comprehensive Functionality & User-Friendliness**: Our training framework supports multi-GPU and multi-machine setups, facilitates the use of DeepSpeed for acceleration, and includes gradient checkpointing optimizations for models with excessive memory demands.
|
||||
* **Code Conciseness & Researcher Accessibility**: We avoid large blocks of complicated code. General-purpose modules are implemented in `diffsynth/trainers/text_to_image.py`, while model-specific training scripts contain only minimal code pertinent to the model architecture, making it researcher-friendly.
|
||||
* **Modular Design & Developer Flexibility**: Built on the universal Pytorch-Lightning framework, our training framework is decoupled in terms of functionality, allowing developers to easily introduce additional training techniques by modifying our scripts to suit their needs.
|
||||
|
||||
Image Examples of fine-tuned LoRA. The prompt is "一只小狗蹦蹦跳跳,周围是姹紫嫣红的鲜花,远处是山脉" (for Chinese models) or "a dog is jumping, flowers around the dog, the background is mountains and clouds" (for English models).
|
||||
|
||||
||Kolors|Stable Diffusion 3|Hunyuan-DiT|
|
||||
|-|-|-|-|
|
||||
|Without LoRA||||
|
||||
|With LoRA||||
|
||||
|
||||
## Install additional packages
|
||||
|
||||
```bash
|
||||
pip install peft lightning
|
||||
```
|
||||
|
||||
## Prepare your dataset
|
||||
|
||||
We provide an example dataset [here](https://modelscope.cn/datasets/buptwq/lora-stable-diffusion-finetune/files). You need to manage the training images as follows:
|
||||
|
||||
```
|
||||
data/dog/
|
||||
└── train
|
||||
├── 00.jpg
|
||||
├── 01.jpg
|
||||
├── 02.jpg
|
||||
├── 03.jpg
|
||||
├── 04.jpg
|
||||
└── metadata.csv
|
||||
```
|
||||
|
||||
`metadata.csv`:
|
||||
|
||||
```
|
||||
file_name,text
|
||||
00.jpg,a dog
|
||||
01.jpg,a dog
|
||||
02.jpg,a dog
|
||||
03.jpg,a dog
|
||||
04.jpg,a dog
|
||||
```
|
||||
|
||||
Note that if the model is Chinese model (for example, Hunyuan-DiT and Kolors), we recommand to use Chinese texts in the dataset. For example
|
||||
|
||||
```
|
||||
file_name,text
|
||||
00.jpg,一只小狗
|
||||
01.jpg,一只小狗
|
||||
02.jpg,一只小狗
|
||||
03.jpg,一只小狗
|
||||
04.jpg,一只小狗
|
||||
```
|
||||
|
||||
## Train a LoRA model
|
||||
|
||||
General options:
|
||||
|
||||
```
|
||||
--lora_target_modules LORA_TARGET_MODULES
|
||||
Layers with LoRA modules.
|
||||
--dataset_path DATASET_PATH
|
||||
The path of the Dataset.
|
||||
--output_path OUTPUT_PATH
|
||||
Path to save the model.
|
||||
--steps_per_epoch STEPS_PER_EPOCH
|
||||
Number of steps per epoch.
|
||||
--height HEIGHT Image height.
|
||||
--width WIDTH Image width.
|
||||
--center_crop Whether to center crop the input images to the resolution. If not set, the images will be randomly cropped. The images will be resized to the resolution first before cropping.
|
||||
--random_flip Whether to randomly flip images horizontally
|
||||
--batch_size BATCH_SIZE
|
||||
Batch size (per device) for the training dataloader.
|
||||
--dataloader_num_workers DATALOADER_NUM_WORKERS
|
||||
Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
|
||||
--precision {32,16,16-mixed}
|
||||
Training precision
|
||||
--learning_rate LEARNING_RATE
|
||||
Learning rate.
|
||||
--lora_rank LORA_RANK
|
||||
The dimension of the LoRA update matrices.
|
||||
--lora_alpha LORA_ALPHA
|
||||
The weight of the LoRA update matrices.
|
||||
--use_gradient_checkpointing
|
||||
Whether to use gradient checkpointing.
|
||||
--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
|
||||
The number of batches in gradient accumulation.
|
||||
--training_strategy {auto,deepspeed_stage_1,deepspeed_stage_2,deepspeed_stage_3}
|
||||
Training strategy
|
||||
--max_epochs MAX_EPOCHS
|
||||
Number of epochs.
|
||||
--modelscope_model_id MODELSCOPE_MODEL_ID
|
||||
Model ID on ModelScope (https://www.modelscope.cn/). The model will be uploaded to ModelScope automatically if you provide a Model ID.
|
||||
--modelscope_access_token MODELSCOPE_ACCESS_TOKEN
|
||||
Access key on ModelScope (https://www.modelscope.cn/). Required if you want to upload the model to ModelScope.
|
||||
```
|
||||
|
||||
### Kolors
|
||||
|
||||
The following files will be used for constructing Kolors. You can download Kolors from [huggingface](https://huggingface.co/Kwai-Kolors/Kolors) or [modelscope](https://modelscope.cn/models/Kwai-Kolors/Kolors). Due to precision overflow issues, we need to download an additional VAE model (from [huggingface](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) or [modelscope](https://modelscope.cn/models/AI-ModelScope/sdxl-vae-fp16-fix)). You can use the following code to download these files:
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["Kolors", "SDXL-vae-fp16-fix"])
|
||||
```
|
||||
|
||||
```
|
||||
models
|
||||
├── kolors
|
||||
│ └── Kolors
|
||||
│ ├── text_encoder
|
||||
│ │ ├── config.json
|
||||
│ │ ├── pytorch_model-00001-of-00007.bin
|
||||
│ │ ├── pytorch_model-00002-of-00007.bin
|
||||
│ │ ├── pytorch_model-00003-of-00007.bin
|
||||
│ │ ├── pytorch_model-00004-of-00007.bin
|
||||
│ │ ├── pytorch_model-00005-of-00007.bin
|
||||
│ │ ├── pytorch_model-00006-of-00007.bin
|
||||
│ │ ├── pytorch_model-00007-of-00007.bin
|
||||
│ │ └── pytorch_model.bin.index.json
|
||||
│ ├── unet
|
||||
│ │ └── diffusion_pytorch_model.safetensors
|
||||
│ └── vae
|
||||
│ └── diffusion_pytorch_model.safetensors
|
||||
└── sdxl-vae-fp16-fix
|
||||
└── diffusion_pytorch_model.safetensors
|
||||
```
|
||||
|
||||
Launch the training task using the following command:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES="0" python examples/train/kolors/train_kolors_lora.py \
|
||||
--pretrained_unet_path models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors \
|
||||
--pretrained_text_encoder_path models/kolors/Kolors/text_encoder \
|
||||
--pretrained_fp16_vae_path models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors \
|
||||
--dataset_path data/dog \
|
||||
--output_path ./models \
|
||||
--max_epochs 1 \
|
||||
--steps_per_epoch 500 \
|
||||
--height 1024 \
|
||||
--width 1024 \
|
||||
--center_crop \
|
||||
--precision "16-mixed" \
|
||||
--learning_rate 1e-4 \
|
||||
--lora_rank 4 \
|
||||
--lora_alpha 4 \
|
||||
--use_gradient_checkpointing
|
||||
```
|
||||
|
||||
For more information about the parameters, please use `python examples/train/kolors/train_kolors_lora.py -h` to see the details.
|
||||
|
||||
After training, use `model_manager.load_lora` to load the LoRA for inference.
|
||||
|
||||
```python
|
||||
from diffsynth import ModelManager, SDXLImagePipeline
|
||||
import torch
|
||||
|
||||
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
|
||||
file_path_list=[
|
||||
"models/kolors/Kolors/text_encoder",
|
||||
"models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors",
|
||||
"models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors"
|
||||
])
|
||||
model_manager.load_lora("models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
|
||||
pipe = SDXLImagePipeline.from_model_manager(model_manager)
|
||||
|
||||
torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt="一只小狗蹦蹦跳跳,周围是姹紫嫣红的鲜花,远处是山脉",
|
||||
negative_prompt="",
|
||||
cfg_scale=7.5,
|
||||
num_inference_steps=100, width=1024, height=1024,
|
||||
)
|
||||
image.save("image_with_lora.jpg")
|
||||
```
|
||||
|
||||
### Stable Diffusion 3
|
||||
|
||||
Only one file is required in the training script. You can use [`sd3_medium_incl_clips.safetensors`](https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/sd3_medium_incl_clips.safetensors) (without T5 encoder) or [`sd3_medium_incl_clips_t5xxlfp16.safetensors`](https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/sd3_medium_incl_clips_t5xxlfp16.safetensors) (with T5 encoder). Please use the following code to download these files:
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["StableDiffusion3", "StableDiffusion3_without_T5"])
|
||||
```
|
||||
|
||||
```
|
||||
models/stable_diffusion_3/
|
||||
├── Put Stable Diffusion 3 checkpoints here.txt
|
||||
├── sd3_medium_incl_clips.safetensors
|
||||
└── sd3_medium_incl_clips_t5xxlfp16.safetensors
|
||||
```
|
||||
|
||||
Launch the training task using the following command:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES="0" python examples/train/stable_diffusion_3/train_sd3_lora.py \
|
||||
--pretrained_path models/stable_diffusion_3/sd3_medium_incl_clips.safetensors \
|
||||
--dataset_path data/dog \
|
||||
--output_path ./models \
|
||||
--max_epochs 1 \
|
||||
--steps_per_epoch 500 \
|
||||
--height 1024 \
|
||||
--width 1024 \
|
||||
--center_crop \
|
||||
--precision "16-mixed" \
|
||||
--learning_rate 1e-4 \
|
||||
--lora_rank 4 \
|
||||
--lora_alpha 4 \
|
||||
--use_gradient_checkpointing
|
||||
```
|
||||
|
||||
For more information about the parameters, please use `python examples/train/stable_diffusion_3/train_sd3_lora.py -h` to see the details.
|
||||
|
||||
After training, use `model_manager.load_lora` to load the LoRA for inference.
|
||||
|
||||
```python
|
||||
from diffsynth import ModelManager, SD3ImagePipeline
|
||||
import torch
|
||||
|
||||
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
|
||||
file_path_list=["models/stable_diffusion_3/sd3_medium_incl_clips.safetensors"])
|
||||
model_manager.load_lora("models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
|
||||
pipe = SD3ImagePipeline.from_model_manager(model_manager)
|
||||
|
||||
torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt="a dog is jumping, flowers around the dog, the background is mountains and clouds",
|
||||
negative_prompt="bad quality, poor quality, doll, disfigured, jpg, toy, bad anatomy, missing limbs, missing fingers, 3d, cgi, extra tails",
|
||||
cfg_scale=7.5,
|
||||
num_inference_steps=100, width=1024, height=1024,
|
||||
)
|
||||
image.save("image_with_lora.jpg")
|
||||
```
|
||||
|
||||
### Hunyuan-DiT
|
||||
|
||||
Four files will be used for constructing Hunyuan DiT. You can download them from [huggingface](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT) or [modelscope](https://www.modelscope.cn/models/modelscope/HunyuanDiT/summary). You can use the following code to download these files:
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["HunyuanDiT"])
|
||||
```
|
||||
|
||||
```
|
||||
models/HunyuanDiT/
|
||||
├── Put Hunyuan DiT checkpoints here.txt
|
||||
└── t2i
|
||||
├── clip_text_encoder
|
||||
│ └── pytorch_model.bin
|
||||
├── model
|
||||
│ └── pytorch_model_ema.pt
|
||||
├── mt5
|
||||
│ └── pytorch_model.bin
|
||||
└── sdxl-vae-fp16-fix
|
||||
└── diffusion_pytorch_model.bin
|
||||
```
|
||||
|
||||
Launch the training task using the following command:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES="0" python examples/train/hunyuan_dit/train_hunyuan_dit_lora.py \
|
||||
--pretrained_path models/HunyuanDiT/t2i \
|
||||
--dataset_path data/dog \
|
||||
--output_path ./models \
|
||||
--max_epochs 1 \
|
||||
--steps_per_epoch 500 \
|
||||
--height 1024 \
|
||||
--width 1024 \
|
||||
--center_crop \
|
||||
--precision "16-mixed" \
|
||||
--learning_rate 1e-4 \
|
||||
--lora_rank 4 \
|
||||
--lora_alpha 4 \
|
||||
--use_gradient_checkpointing
|
||||
```
|
||||
|
||||
For more information about the parameters, please use `python examples/train/hunyuan_dit/train_hunyuan_dit_lora.py -h` to see the details.
|
||||
|
||||
After training, use `model_manager.load_lora` to load the LoRA for inference.
|
||||
|
||||
```python
|
||||
from diffsynth import ModelManager, HunyuanDiTImagePipeline
|
||||
import torch
|
||||
|
||||
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
|
||||
file_path_list=[
|
||||
"models/HunyuanDiT/t2i/clip_text_encoder/pytorch_model.bin",
|
||||
"models/HunyuanDiT/t2i/model/pytorch_model_ema.pt",
|
||||
"models/HunyuanDiT/t2i/mt5/pytorch_model.bin",
|
||||
"models/HunyuanDiT/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin"
|
||||
])
|
||||
model_manager.load_lora("models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
|
||||
pipe = HunyuanDiTImagePipeline.from_model_manager(model_manager)
|
||||
|
||||
torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt="一只小狗蹦蹦跳跳,周围是姹紫嫣红的鲜花,远处是山脉",
|
||||
negative_prompt="",
|
||||
cfg_scale=7.5,
|
||||
num_inference_steps=100, width=1024, height=1024,
|
||||
)
|
||||
image.save("image_with_lora.jpg")
|
||||
```
|
||||
|
||||
### Stable Diffusion
|
||||
|
||||
Only one file is required in the training script. We support the mainstream checkpoints in [CivitAI](https://civitai.com/). By default, we use the base Stable Diffusion v1.5. You can download it from [huggingface](https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors) or [modelscope](https://www.modelscope.cn/models/AI-ModelScope/stable-diffusion-v1-5/resolve/master/v1-5-pruned-emaonly.safetensors). You can use the following code to download this file:
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["StableDiffusion_v15"])
|
||||
```
|
||||
|
||||
```
|
||||
models/stable_diffusion
|
||||
├── Put Stable Diffusion checkpoints here.txt
|
||||
└── v1-5-pruned-emaonly.safetensors
|
||||
```
|
||||
|
||||
Launch the training task using the following command:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES="0" python examples/train/stable_diffusion/train_sd_lora.py \
|
||||
--pretrained_path models/stable_diffusion/v1-5-pruned-emaonly.safetensors \
|
||||
--dataset_path data/dog \
|
||||
--output_path ./models \
|
||||
--max_epochs 1 \
|
||||
--steps_per_epoch 500 \
|
||||
--height 512 \
|
||||
--width 512 \
|
||||
--center_crop \
|
||||
--precision "16-mixed" \
|
||||
--learning_rate 1e-4 \
|
||||
--lora_rank 4 \
|
||||
--lora_alpha 4 \
|
||||
--use_gradient_checkpointing
|
||||
```
|
||||
|
||||
For more information about the parameters, please use `python examples/train/stable_diffusion/train_sd_lora.py -h` to see the details.
|
||||
|
||||
After training, use `model_manager.load_lora` to load the LoRA for inference.
|
||||
|
||||
```python
|
||||
from diffsynth import ModelManager, SDImagePipeline
|
||||
import torch
|
||||
|
||||
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
|
||||
file_path_list=["models/stable_diffusion/v1-5-pruned-emaonly.safetensors"])
|
||||
model_manager.load_lora("models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
|
||||
pipe = SDImagePipeline.from_model_manager(model_manager)
|
||||
|
||||
torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt="a dog is jumping, flowers around the dog, the background is mountains and clouds",
|
||||
negative_prompt="bad quality, poor quality, doll, disfigured, jpg, toy, bad anatomy, missing limbs, missing fingers, 3d, cgi, extra tails",
|
||||
cfg_scale=7.5,
|
||||
num_inference_steps=100, width=512, height=512,
|
||||
)
|
||||
image.save("image_with_lora.jpg")
|
||||
```
|
||||
|
||||
### Stable Diffusion XL
|
||||
|
||||
Only one file is required in the training script. We support the mainstream checkpoints in [CivitAI](https://civitai.com/). By default, we use the base Stable Diffusion XL. You can download it from [huggingface](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors) or [modelscope](https://www.modelscope.cn/models/AI-ModelScope/stable-diffusion-xl-base-1.0/resolve/master/sd_xl_base_1.0.safetensors). You can use the following code to download this file:
|
||||
|
||||
```python
|
||||
from diffsynth import download_models
|
||||
|
||||
download_models(["StableDiffusionXL_v1"])
|
||||
```
|
||||
|
||||
```
|
||||
models/stable_diffusion_xl
|
||||
├── Put Stable Diffusion XL checkpoints here.txt
|
||||
└── sd_xl_base_1.0.safetensors
|
||||
```
|
||||
|
||||
We observed that Stable Diffusion XL is not float16-safe, thus we recommand users to use float32.
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES="0" python examples/train/stable_diffusion_xl/train_sdxl_lora.py \
|
||||
--pretrained_path models/stable_diffusion_xl/sd_xl_base_1.0.safetensors \
|
||||
--dataset_path data/dog \
|
||||
--output_path ./models \
|
||||
--max_epochs 1 \
|
||||
--steps_per_epoch 500 \
|
||||
--height 1024 \
|
||||
--width 1024 \
|
||||
--center_crop \
|
||||
--precision "32" \
|
||||
--learning_rate 1e-4 \
|
||||
--lora_rank 4 \
|
||||
--lora_alpha 4 \
|
||||
--use_gradient_checkpointing
|
||||
```
|
||||
|
||||
For more information about the parameters, please use `python examples/train/stable_diffusion_xl/train_sdxl_lora.py -h` to see the details.
|
||||
|
||||
After training, use `model_manager.load_lora` to load the LoRA for inference.
|
||||
|
||||
```python
|
||||
from diffsynth import ModelManager, SDXLImagePipeline
|
||||
import torch
|
||||
|
||||
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
|
||||
file_path_list=["models/stable_diffusion_xl/sd_xl_base_1.0.safetensors"])
|
||||
model_manager.load_lora("models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
|
||||
pipe = SDXLImagePipeline.from_model_manager(model_manager)
|
||||
|
||||
torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt="a dog is jumping, flowers around the dog, the background is mountains and clouds",
|
||||
negative_prompt="bad quality, poor quality, doll, disfigured, jpg, toy, bad anatomy, missing limbs, missing fingers, 3d, cgi, extra tails",
|
||||
cfg_scale=7.5,
|
||||
num_inference_steps=100, width=1024, height=1024,
|
||||
)
|
||||
image.save("image_with_lora.jpg")
|
||||
```
|
||||
24
docs/source_en/GetStarted/Installation.md
Normal file
24
docs/source_en/GetStarted/Installation.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Installation
|
||||
|
||||
## From Source
|
||||
|
||||
1. Clone the source repository:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
||||
```
|
||||
|
||||
2. Navigate to the project directory and install:
|
||||
|
||||
```bash
|
||||
cd DiffSynth-Studio
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## From PyPI
|
||||
|
||||
Install directly via PyPI:
|
||||
|
||||
```bash
|
||||
pip install diffsynth
|
||||
```
|
||||
0
docs/source_en/GetStarted/ModelManager.md
Normal file
0
docs/source_en/GetStarted/ModelManager.md
Normal file
17
docs/source_en/GetStarted/Models.md
Normal file
17
docs/source_en/GetStarted/Models.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Models
|
||||
|
||||
Until now, DiffSynth Studio has supported the following models:
|
||||
|
||||
* [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev)
|
||||
* [ExVideo](https://huggingface.co/ECNU-CILab/ExVideo-SVD-128f-v1)
|
||||
* [Kolors](https://huggingface.co/Kwai-Kolors/Kolors)
|
||||
* [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium)
|
||||
* [Stable Video Diffusion](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt)
|
||||
* [Hunyuan-DiT](https://github.com/Tencent/HunyuanDiT)
|
||||
* [RIFE](https://github.com/hzwer/ECCV2022-RIFE)
|
||||
* [ESRGAN](https://github.com/xinntao/ESRGAN)
|
||||
* [Ip-Adapter](https://github.com/tencent-ailab/IP-Adapter)
|
||||
* [AnimateDiff](https://github.com/guoyww/animatediff/)
|
||||
* [ControlNet](https://github.com/lllyasviel/ControlNet)
|
||||
* [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
|
||||
* [Stable Diffusion](https://huggingface.co/runwayml/stable-diffusion-v1-5)
|
||||
27
docs/source_en/GetStarted/Pipelines.md
Normal file
27
docs/source_en/GetStarted/Pipelines.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Pipelines
|
||||
|
||||
So far, the following table lists our pipelines and the models supported by each pipeline.
|
||||
|
||||
## Image Pipelines
|
||||
|
||||
Pipelines for generating images from text descriptions. Each pipeline relies on specific encoder and decoder models.
|
||||
|
||||
| Pipeline | Models |
|
||||
|----------------------------|----------------------------------------------------------------|
|
||||
| HunyuanDiTImagePipeline | text_encoder: HunyuanDiTCLIPTextEncoder<br>text_encoder_t5: HunyuanDiTT5TextEncoder<br>dit: HunyuanDiT<br>vae_decoder: SDVAEDecoder<br>vae_encoder: SDVAEEncoder |
|
||||
| SDImagePipeline | text_encoder: SDTextEncoder<br>unet: SDUNet<br>vae_decoder: SDVAEDecoder<br>vae_encoder: SDVAEEncoder<br>controlnet: MultiControlNetManager<br>ipadapter_image_encoder: IpAdapterCLIPImageEmbedder<br>ipadapter: SDIpAdapter |
|
||||
| SD3ImagePipeline | text_encoder_1: SD3TextEncoder1<br>text_encoder_2: SD3TextEncoder2<br>text_encoder_3: SD3TextEncoder3<br>dit: SD3DiT<br>vae_decoder: SD3VAEDecoder<br>vae_encoder: SD3VAEEncoder |
|
||||
| SDXLImagePipeline | text_encoder: SDXLTextEncoder<br>text_encoder_2: SDXLTextEncoder2<br>text_encoder_kolors: ChatGLMModel<br>unet: SDXLUNet<br>vae_decoder: SDXLVAEDecoder<br>vae_encoder: SDXLVAEEncoder<br>controlnet: MultiControlNetManager<br>ipadapter_image_encoder: IpAdapterXLCLIPImageEmbedder<br>ipadapter: SDXLIpAdapter |
|
||||
|
||||
## Video Pipelines
|
||||
|
||||
Pipelines for generating videos from text descriptions. In addition to the models required for image generation, they include models for handling motion modules.
|
||||
|
||||
| Pipeline | Models |
|
||||
|----------------------------|----------------------------------------------------------------|
|
||||
| SDVideoPipeline | text_encoder: SDTextEncoder<br>unet: SDUNet<br>vae_decoder: SDVAEDecoder<br>vae_encoder: SDVAEEncoder<br>controlnet: MultiControlNetManager<br>ipadapter_image_encoder: IpAdapterCLIPImageEmbedder<br>ipadapter: SDIpAdapter<br>motion_modules: SDMotionModel |
|
||||
| SDXLVideoPipeline | text_encoder: SDXLTextEncoder<br>text_encoder_2: SDXLTextEncoder2<br>text_encoder_kolors: ChatGLMModel<br>unet: SDXLUNet<br>vae_decoder: SDXLVAEDecoder<br>vae_encoder: SDXLVAEEncoder<br>ipadapter_image_encoder: IpAdapterXLCLIPImageEmbedder<br>ipadapter: SDXLIpAdapter<br>motion_modules: SDXLMotionModel |
|
||||
| SVDVideoPipeline | image_encoder: SVDImageEncoder<br>unet: SVDUNet<br>vae_encoder: SVDVAEEncoder<br>vae_decoder: SVDVAEDecoder |
|
||||
|
||||
|
||||
|
||||
35
docs/source_en/GetStarted/PromptProcessing.md
Normal file
35
docs/source_en/GetStarted/PromptProcessing.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Prompt Processing
|
||||
|
||||
DiffSynth includes prompt processing functionality, which is divided into:
|
||||
|
||||
- **Prompt Refiners (`prompt_refiner_classes`)**: Includes prompt refinement, prompt translation from Chinese to English, and both refinement and translation of prompts. Available parameters are as follows:
|
||||
|
||||
- **English Prompt Refinement**: 'BeautifulPrompt', using the model [pai-bloom-1b1-text2prompt-sd](https://modelscope.cn/models/AI-ModelScope/pai-bloom-1b1-text2prompt-sd).
|
||||
|
||||
- **Prompt Translation from Chinese to English**: 'Translator', using the model [opus-mt-zh-e](https://modelscope.cn/models/moxying/opus-mt-zh-en).
|
||||
|
||||
- **Prompt Translation and Refinement**: 'QwenPrompt', using the model [Qwen2-1.5B-Instruct](https://modelscope.cn/models/qwen/Qwen2-1.5B-Instruct).
|
||||
|
||||
- **Prompt Extenders (`prompt_extender_classes`)**: Based on Omost's prompt partition control expansion. Available parameter is:
|
||||
|
||||
- **Prompt Partition Expansion**: 'OmostPromter'.
|
||||
|
||||
## Usage Instructions
|
||||
|
||||
### Prompt Refiners
|
||||
|
||||
When loading the model pipeline, you can specify the desired prompt refiner functionality using the `prompt_refiner_classes` parameter. For example code, refer to [sd_prompt_refining.py](examples/image_synthesis/sd_prompt_refining.py).
|
||||
|
||||
Available `prompt_refiner_classes` parameters include: Translator, BeautifulPrompt, QwenPrompt.
|
||||
|
||||
```python
|
||||
pipe = SDXLImagePipeline.from_model_manager(model_manager, prompt_refiner_classes=[Translator, BeautifulPrompt])
|
||||
```
|
||||
|
||||
### Prompt Extenders
|
||||
|
||||
When loading the model pipeline, you can specify the desired prompt extender using the prompt_extender_classes parameter. For example code, refer to [omost_flux_text_to_image.py](examples/image_synthesis/omost_flux_text_to_image.py).
|
||||
|
||||
```python
|
||||
pipe = FluxImagePipeline.from_model_manager(model_manager, prompt_extender_classes=[OmostPromter])
|
||||
```
|
||||
11
docs/source_en/GetStarted/Schedulers.md
Normal file
11
docs/source_en/GetStarted/Schedulers.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Schedulers
|
||||
|
||||
Schedulers control the entire denoising (or sampling) process of the model. When loading the Pipeline, DiffSynth automatically selects the most suitable schedulers for the current Pipeline, requiring no additional configuration.
|
||||
|
||||
The supported schedulers are:
|
||||
|
||||
- **EnhancedDDIMScheduler**: Extends the denoising process introduced in the Denoising Diffusion Probabilistic Models (DDPM) with non-Markovian guidance.
|
||||
|
||||
- **FlowMatchScheduler**: Implements the flow matching sampling method introduced in Stable Diffusion 3.
|
||||
|
||||
- **ContinuousODEScheduler**: A scheduler based on Ordinary Differential Equations (ODE).
|
||||
0
docs/source_en/GetStarted/WebUI.md
Normal file
0
docs/source_en/GetStarted/WebUI.md
Normal file
50
docs/source_en/conf.py
Normal file
50
docs/source_en/conf.py
Normal file
@@ -0,0 +1,50 @@
|
||||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# For the full list of built-in configuration values, see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
|
||||
|
||||
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.abspath('../../diffsynth'))
|
||||
|
||||
project = 'DiffSynth-Studio'
|
||||
copyright = '2024, ModelScope'
|
||||
author = 'ModelScope'
|
||||
release = '0.1.0'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
||||
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.napoleon',
|
||||
'sphinx.ext.doctest',
|
||||
'sphinx.ext.intersphinx',
|
||||
'sphinx.ext.todo',
|
||||
'sphinx.ext.coverage',
|
||||
'sphinx.ext.imgmath',
|
||||
'sphinx.ext.viewcode',
|
||||
'recommonmark',
|
||||
'sphinx_markdown_tables'
|
||||
]
|
||||
|
||||
templates_path = ['_templates']
|
||||
exclude_patterns = []
|
||||
|
||||
|
||||
source_suffix = ['.rst', '.md']
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
|
||||
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
html_static_path = ['_static']
|
||||
# multi-language docs
|
||||
language = 'en'
|
||||
locale_dirs = ['../locales/'] # path is example but recommended.
|
||||
gettext_compact = False # optional.
|
||||
gettext_uuid = True # optional.
|
||||
32
docs/source_en/index.rst
Normal file
32
docs/source_en/index.rst
Normal file
@@ -0,0 +1,32 @@
|
||||
.. DiffSynth-Studio documentation master file, created by
|
||||
sphinx-quickstart on Thu Sep 5 16:39:24 2024.
|
||||
You can adapt this file completely to your liking, but it should at least
|
||||
contain the root `toctree` directive.
|
||||
|
||||
DiffSynth-Studio documentation
|
||||
==============================
|
||||
|
||||
Add your content using ``reStructuredText`` syntax. See the
|
||||
`reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html>`_
|
||||
documentation for details.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Contents:
|
||||
|
||||
GetStarted/A_simple_example.md
|
||||
GetStarted/Download_models.md
|
||||
GetStarted/ModelManager.md
|
||||
GetStarted/Models.md
|
||||
GetStarted/Pipelines.md
|
||||
GetStarted/PromptProcessing.md
|
||||
GetStarted/Schedulers.md
|
||||
GetStarted/Fine-tuning.md
|
||||
GetStarted/Extensions.md
|
||||
GetStarted/WebUI.md
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: API Docs
|
||||
4
docs/source_en/requirement.txt
Normal file
4
docs/source_en/requirement.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
recommonmark
|
||||
sphinx_rtd_theme
|
||||
myst-parser
|
||||
sphinx-markdown-tables
|
||||
Reference in New Issue
Block a user