mirror of
https://github.com/modelscope/DiffSynth-Studio.git
synced 2026-04-13 13:05:45 +00:00
Support ERNIE-Image (#1389)
* ernie-image pipeline * ernie-image inference and training * style fix * ernie docs * lowvram * final style fix * pr-review * pr-fix round2 * set uniform training weight * fix * update lowvram docs
This commit is contained in:
61
README.md
61
README.md
@@ -33,6 +33,7 @@ We believe that a well-developed open-source code framework can lower the thresh
|
||||
> DiffSynth-Studio has undergone major version updates, and some old features are no longer maintained. If you need to use old features, please switch to the [last historical version](https://github.com/modelscope/DiffSynth-Studio/tree/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3) before the major version update.
|
||||
|
||||
> Currently, the development personnel of this project are limited, with most of the work handled by [Artiprocher](https://github.com/Artiprocher) and [mi804](https://github.com/mi804). Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.
|
||||
|
||||
- **March 19, 2026**: Added support for [openmoss/MOVA-720p](https://modelscope.cn/models/openmoss/MOVA-720p) and [openmoss/MOVA-360p](https://modelscope.cn/models/openmoss/MOVA-360p) models, including training and inference capabilities. [Documentation](/docs/en/Model_Details/Wan.md) and [example code](/examples/mova/) are now available.
|
||||
|
||||
- **March 12, 2026**: We have added support for the [LTX-2.3](https://modelscope.cn/models/Lightricks/LTX-2.3) audio-video generation model. The features includes text-to-audio/video, image-to-audio/video, IC-LoRA control, audio-to-video, and audio-video inpainting. We have supported the complete inference and training functionalities. For details, please refer to the [documentation](/docs/en/Model_Details/LTX-2.md) and [code](/examples/ltx2/).
|
||||
@@ -876,6 +877,66 @@ Example code for Wan is available at: [/examples/wanvideo/](/examples/wanvideo/)
|
||||
|
||||
</details>
|
||||
|
||||
#### ERNIE-Image: [/docs/en/Model_Details/ERNIE-Image.md](/docs/en/Model_Details/ERNIE-Image.md)
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Quick Start</summary>
|
||||
|
||||
Running the following code will quickly load the [baidu/ERNIE-Image](https://www.modelscope.cn/models/baidu/ERNIE-Image) model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.
|
||||
|
||||
```python
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
vram_config = {
|
||||
"offload_dtype": torch.bfloat16,
|
||||
"offload_device": "cpu",
|
||||
"onload_dtype": torch.bfloat16,
|
||||
"onload_device": "cpu",
|
||||
"preparing_dtype": torch.bfloat16,
|
||||
"preparing_device": "cuda",
|
||||
"computation_dtype": torch.bfloat16,
|
||||
"computation_device": "cuda",
|
||||
}
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Examples</summary>
|
||||
|
||||
Example code for ERNIE-Image is available at: [/examples/ernie_image/](/examples/ernie_image/)
|
||||
|
||||
| Model ID | Inference | Low VRAM Inference | Full Training | Full Training Validation | LoRA Training | LoRA Training Validation |
|
||||
|-|-|-|-|-|-|-|
|
||||
|[baidu/ERNIE-Image: T2I](https://www.modelscope.cn/models/baidu/ERNIE-Image)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference_low_vram/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/full/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_full/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_lora/Ernie-Image-T2I.py)|
|
||||
|
||||
</details>
|
||||
|
||||
## Innovative Achievements
|
||||
|
||||
DiffSynth-Studio is not just an engineered model framework, but also an incubator for innovative achievements.
|
||||
|
||||
60
README_zh.md
60
README_zh.md
@@ -877,6 +877,66 @@ Wan 的示例代码位于:[/examples/wanvideo/](/examples/wanvideo/)
|
||||
|
||||
</details>
|
||||
|
||||
#### ERNIE-Image: [/docs/zh/Model_Details/ERNIE-Image.md](/docs/zh/Model_Details/ERNIE-Image.md)
|
||||
|
||||
<details>
|
||||
|
||||
<summary>快速开始</summary>
|
||||
|
||||
运行以下代码可以快速加载 [baidu/ERNIE-Image](https://www.modelscope.cn/models/baidu/ERNIE-Image) 模型并进行推理。显存管理已启动,框架会自动根据剩余显存控制模型参数的加载,最低 3G 显存即可运行。
|
||||
|
||||
```python
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
vram_config = {
|
||||
"offload_dtype": torch.bfloat16,
|
||||
"offload_device": "cpu",
|
||||
"onload_dtype": torch.bfloat16,
|
||||
"onload_device": "cpu",
|
||||
"preparing_dtype": torch.bfloat16,
|
||||
"preparing_device": "cuda",
|
||||
"computation_dtype": torch.bfloat16,
|
||||
"computation_device": "cuda",
|
||||
}
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>示例代码</summary>
|
||||
|
||||
ERNIE-Image 的示例代码位于:[/examples/ernie_image/](/examples/ernie_image/)
|
||||
|
||||
| 模型 ID | 推理 | 低显存推理 | 全量训练 | 全量训练后验证 | LoRA 训练 | LoRA 训练后验证 |
|
||||
|-|-|-|-|-|-|-|
|
||||
|[baidu/ERNIE-Image: T2I](https://www.modelscope.cn/models/baidu/ERNIE-Image)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference_low_vram/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/full/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_full/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_lora/Ernie-Image-T2I.py)|
|
||||
|
||||
</details>
|
||||
|
||||
## 创新成果
|
||||
|
||||
DiffSynth-Studio 不仅仅是一个工程化的模型框架,更是创新成果的孵化器。
|
||||
|
||||
@@ -541,6 +541,22 @@ flux2_series = [
|
||||
},
|
||||
]
|
||||
|
||||
ernie_image_series = [
|
||||
{
|
||||
# Example: ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors")
|
||||
"model_hash": "584c13713849f1af4e03d5f1858b8b7b",
|
||||
"model_name": "ernie_image_dit",
|
||||
"model_class": "diffsynth.models.ernie_image_dit.ErnieImageDiT",
|
||||
},
|
||||
{
|
||||
# Example: ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors")
|
||||
"model_hash": "404ed9f40796a38dd34c1620f1920207",
|
||||
"model_name": "ernie_image_text_encoder",
|
||||
"model_class": "diffsynth.models.ernie_image_text_encoder.ErnieImageTextEncoder",
|
||||
"state_dict_converter": "diffsynth.utils.state_dict_converters.ernie_image_text_encoder.ErnieImageTextEncoderStateDictConverter",
|
||||
},
|
||||
]
|
||||
|
||||
z_image_series = [
|
||||
{
|
||||
# Example: ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="transformer/*.safetensors")
|
||||
@@ -884,4 +900,4 @@ mova_series = [
|
||||
"model_class": "diffsynth.models.mova_dual_tower_bridge.DualTowerConditionalBridge",
|
||||
},
|
||||
]
|
||||
MODEL_CONFIGS = qwen_image_series + wan_series + flux_series + flux2_series + z_image_series + ltx2_series + anima_series + mova_series
|
||||
MODEL_CONFIGS = qwen_image_series + wan_series + flux_series + flux2_series + ernie_image_series + z_image_series + ltx2_series + anima_series + mova_series
|
||||
|
||||
@@ -267,6 +267,18 @@ VRAM_MANAGEMENT_MODULE_MAPS = {
|
||||
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
},
|
||||
"diffsynth.models.ernie_image_dit.ErnieImageDiT": {
|
||||
"diffsynth.models.ernie_image_dit.ErnieImageRMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
"torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
|
||||
"torch.nn.Conv2d": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
"torch.nn.LayerNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
"torch.nn.RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
},
|
||||
"diffsynth.models.ernie_image_text_encoder.ErnieImageTextEncoder": {
|
||||
"torch.nn.Linear": "diffsynth.core.vram.layers.AutoWrappedLinear",
|
||||
"torch.nn.Embedding": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
"transformers.models.ministral3.modeling_ministral3.Ministral3RMSNorm": "diffsynth.core.vram.layers.AutoWrappedModule",
|
||||
},
|
||||
}
|
||||
|
||||
def QwenImageTextEncoder_Module_Map_Updater():
|
||||
|
||||
@@ -4,7 +4,7 @@ from typing_extensions import Literal
|
||||
|
||||
class FlowMatchScheduler():
|
||||
|
||||
def __init__(self, template: Literal["FLUX.1", "Wan", "Qwen-Image", "FLUX.2", "Z-Image", "LTX-2", "Qwen-Image-Lightning"] = "FLUX.1"):
|
||||
def __init__(self, template: Literal["FLUX.1", "Wan", "Qwen-Image", "FLUX.2", "Z-Image", "LTX-2", "Qwen-Image-Lightning", "ERNIE-Image"] = "FLUX.1"):
|
||||
self.set_timesteps_fn = {
|
||||
"FLUX.1": FlowMatchScheduler.set_timesteps_flux,
|
||||
"Wan": FlowMatchScheduler.set_timesteps_wan,
|
||||
@@ -13,6 +13,7 @@ class FlowMatchScheduler():
|
||||
"Z-Image": FlowMatchScheduler.set_timesteps_z_image,
|
||||
"LTX-2": FlowMatchScheduler.set_timesteps_ltx2,
|
||||
"Qwen-Image-Lightning": FlowMatchScheduler.set_timesteps_qwen_image_lightning,
|
||||
"ERNIE-Image": FlowMatchScheduler.set_timesteps_ernie_image,
|
||||
}.get(template, FlowMatchScheduler.set_timesteps_flux)
|
||||
self.num_train_timesteps = 1000
|
||||
|
||||
@@ -129,6 +130,15 @@ class FlowMatchScheduler():
|
||||
timesteps = sigmas * num_train_timesteps
|
||||
return sigmas, timesteps
|
||||
|
||||
@staticmethod
|
||||
def set_timesteps_ernie_image(num_inference_steps=50, denoising_strength=1.0):
|
||||
"""ERNIE-Image scheduler: pure linear sigmas from 1.0 to 0.0, no shift."""
|
||||
num_train_timesteps = 1000
|
||||
sigma_start = denoising_strength
|
||||
sigmas = torch.linspace(sigma_start, 0.0, num_inference_steps + 1)[:-1]
|
||||
timesteps = sigmas * num_train_timesteps
|
||||
return sigmas, timesteps
|
||||
|
||||
@staticmethod
|
||||
def set_timesteps_z_image(num_inference_steps=100, denoising_strength=1.0, shift=None, target_timesteps=None):
|
||||
sigma_min = 0.0
|
||||
@@ -175,6 +185,9 @@ class FlowMatchScheduler():
|
||||
return sigmas, timesteps
|
||||
|
||||
def set_training_weight(self):
|
||||
if self.set_timesteps_fn == FlowMatchScheduler.set_timesteps_ernie_image:
|
||||
self.set_uniform_training_weight()
|
||||
return
|
||||
steps = 1000
|
||||
x = self.timesteps
|
||||
y = torch.exp(-2 * ((x - steps / 2) / steps) ** 2)
|
||||
@@ -185,6 +198,13 @@ class FlowMatchScheduler():
|
||||
bsmntw_weighing = bsmntw_weighing * (len(self.timesteps) / steps)
|
||||
bsmntw_weighing = bsmntw_weighing + bsmntw_weighing[1]
|
||||
self.linear_timesteps_weights = bsmntw_weighing
|
||||
|
||||
def set_uniform_training_weight(self):
|
||||
"""Assign equal weight to every timestep, suitable for linear schedulers like ERNIE-Image."""
|
||||
steps = 1000
|
||||
num_steps = len(self.timesteps)
|
||||
uniform_weight = torch.full((num_steps,), steps / num_steps, dtype=self.timesteps.dtype)
|
||||
self.linear_timesteps_weights = uniform_weight
|
||||
|
||||
def set_timesteps(self, num_inference_steps=100, denoising_strength=1.0, training=False, **kwargs):
|
||||
self.sigmas, self.timesteps = self.set_timesteps_fn(
|
||||
|
||||
362
diffsynth/models/ernie_image_dit.py
Normal file
362
diffsynth/models/ernie_image_dit.py
Normal file
@@ -0,0 +1,362 @@
|
||||
"""
|
||||
Ernie-Image DiT for DiffSynth-Studio.
|
||||
|
||||
Refactored from diffusers ErnieImageTransformer2DModel to use DiffSynth core modules.
|
||||
Default parameters from actual checkpoint config.json (baidu/ERNIE-Image transformer).
|
||||
"""
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
from typing import Optional, Tuple
|
||||
|
||||
from ..core.attention import attention_forward
|
||||
from ..core.gradient import gradient_checkpoint_forward
|
||||
from .flux2_dit import Timesteps, TimestepEmbedding
|
||||
|
||||
|
||||
def rope(pos: torch.Tensor, dim: int, theta: int) -> torch.Tensor:
|
||||
assert dim % 2 == 0
|
||||
scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
|
||||
omega = 1.0 / (theta ** scale)
|
||||
out = torch.einsum("...n,d->...nd", pos, omega)
|
||||
return out.float()
|
||||
|
||||
|
||||
class ErnieImageEmbedND3(nn.Module):
|
||||
def __init__(self, dim: int, theta: int, axes_dim: Tuple[int, int, int]):
|
||||
super().__init__()
|
||||
self.dim = dim
|
||||
self.theta = theta
|
||||
self.axes_dim = list(axes_dim)
|
||||
|
||||
def forward(self, ids: torch.Tensor) -> torch.Tensor:
|
||||
emb = torch.cat([rope(ids[..., i], self.axes_dim[i], self.theta) for i in range(3)], dim=-1)
|
||||
emb = emb.unsqueeze(2)
|
||||
return torch.stack([emb, emb], dim=-1).reshape(*emb.shape[:-1], -1)
|
||||
|
||||
|
||||
class ErnieImagePatchEmbedDynamic(nn.Module):
|
||||
def __init__(self, in_channels: int, embed_dim: int, patch_size: int):
|
||||
super().__init__()
|
||||
self.patch_size = patch_size
|
||||
self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size, bias=True)
|
||||
|
||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||
x = self.proj(x)
|
||||
batch_size, dim, height, width = x.shape
|
||||
return x.reshape(batch_size, dim, height * width).transpose(1, 2).contiguous()
|
||||
|
||||
|
||||
class ErnieImageSingleStreamAttnProcessor:
|
||||
def __call__(
|
||||
self,
|
||||
attn: "ErnieImageAttention",
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
freqs_cis: Optional[torch.Tensor] = None,
|
||||
) -> torch.Tensor:
|
||||
query = attn.to_q(hidden_states)
|
||||
key = attn.to_k(hidden_states)
|
||||
value = attn.to_v(hidden_states)
|
||||
|
||||
query = query.unflatten(-1, (attn.heads, -1))
|
||||
key = key.unflatten(-1, (attn.heads, -1))
|
||||
value = value.unflatten(-1, (attn.heads, -1))
|
||||
|
||||
if attn.norm_q is not None:
|
||||
query = attn.norm_q(query)
|
||||
if attn.norm_k is not None:
|
||||
key = attn.norm_k(key)
|
||||
|
||||
def apply_rotary_emb(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
|
||||
rot_dim = freqs_cis.shape[-1]
|
||||
x, x_pass = x_in[..., :rot_dim], x_in[..., rot_dim:]
|
||||
cos_ = torch.cos(freqs_cis).to(x.dtype)
|
||||
sin_ = torch.sin(freqs_cis).to(x.dtype)
|
||||
x1, x2 = x.chunk(2, dim=-1)
|
||||
x_rotated = torch.cat((-x2, x1), dim=-1)
|
||||
return torch.cat((x * cos_ + x_rotated * sin_, x_pass), dim=-1)
|
||||
|
||||
if freqs_cis is not None:
|
||||
query = apply_rotary_emb(query, freqs_cis)
|
||||
key = apply_rotary_emb(key, freqs_cis)
|
||||
|
||||
if attention_mask is not None and attention_mask.ndim == 2:
|
||||
attention_mask = attention_mask[:, None, None, :]
|
||||
|
||||
hidden_states = attention_forward(
|
||||
query, key, value,
|
||||
q_pattern="b s n d",
|
||||
k_pattern="b s n d",
|
||||
v_pattern="b s n d",
|
||||
out_pattern="b s n d",
|
||||
attn_mask=attention_mask,
|
||||
)
|
||||
|
||||
hidden_states = hidden_states.flatten(2, 3)
|
||||
hidden_states = hidden_states.to(query.dtype)
|
||||
output = attn.to_out[0](hidden_states)
|
||||
|
||||
return output
|
||||
|
||||
|
||||
class ErnieImageAttention(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
query_dim: int,
|
||||
heads: int = 8,
|
||||
dim_head: int = 64,
|
||||
dropout: float = 0.0,
|
||||
bias: bool = False,
|
||||
qk_norm: str = "rms_norm",
|
||||
out_bias: bool = True,
|
||||
eps: float = 1e-5,
|
||||
out_dim: int = None,
|
||||
elementwise_affine: bool = True,
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
self.head_dim = dim_head
|
||||
self.inner_dim = out_dim if out_dim is not None else dim_head * heads
|
||||
self.query_dim = query_dim
|
||||
self.out_dim = out_dim if out_dim is not None else query_dim
|
||||
self.heads = out_dim // dim_head if out_dim is not None else heads
|
||||
|
||||
self.use_bias = bias
|
||||
self.dropout = dropout
|
||||
|
||||
self.to_q = nn.Linear(query_dim, self.inner_dim, bias=bias)
|
||||
self.to_k = nn.Linear(query_dim, self.inner_dim, bias=bias)
|
||||
self.to_v = nn.Linear(query_dim, self.inner_dim, bias=bias)
|
||||
|
||||
if qk_norm == "layer_norm":
|
||||
self.norm_q = nn.LayerNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
|
||||
self.norm_k = nn.LayerNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
|
||||
elif qk_norm == "rms_norm":
|
||||
self.norm_q = nn.RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
|
||||
self.norm_k = nn.RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
|
||||
else:
|
||||
raise ValueError(
|
||||
f"unknown qk_norm: {qk_norm}. Should be one of None, 'layer_norm', 'rms_norm'."
|
||||
)
|
||||
|
||||
self.to_out = nn.ModuleList([])
|
||||
self.to_out.append(nn.Linear(self.inner_dim, self.out_dim, bias=out_bias))
|
||||
|
||||
self.processor = ErnieImageSingleStreamAttnProcessor()
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
image_rotary_emb: Optional[torch.Tensor] = None,
|
||||
) -> torch.Tensor:
|
||||
return self.processor(self, hidden_states, attention_mask, image_rotary_emb)
|
||||
|
||||
|
||||
class ErnieImageFeedForward(nn.Module):
|
||||
def __init__(self, hidden_size: int, ffn_hidden_size: int):
|
||||
super().__init__()
|
||||
self.gate_proj = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
|
||||
self.up_proj = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
|
||||
self.linear_fc2 = nn.Linear(ffn_hidden_size, hidden_size, bias=False)
|
||||
|
||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||
return self.linear_fc2(self.up_proj(x) * F.gelu(self.gate_proj(x)))
|
||||
|
||||
|
||||
class ErnieImageRMSNorm(nn.Module):
|
||||
def __init__(self, dim: int, eps: float = 1e-6):
|
||||
super().__init__()
|
||||
self.eps = eps
|
||||
self.weight = nn.Parameter(torch.ones(dim))
|
||||
|
||||
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
||||
input_dtype = hidden_states.dtype
|
||||
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
|
||||
hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
|
||||
hidden_states = hidden_states * self.weight
|
||||
return hidden_states.to(input_dtype)
|
||||
|
||||
|
||||
class ErnieImageSharedAdaLNBlock(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
hidden_size: int,
|
||||
num_heads: int,
|
||||
ffn_hidden_size: int,
|
||||
eps: float = 1e-6,
|
||||
qk_layernorm: bool = True,
|
||||
):
|
||||
super().__init__()
|
||||
self.adaLN_sa_ln = ErnieImageRMSNorm(hidden_size, eps=eps)
|
||||
self.self_attention = ErnieImageAttention(
|
||||
query_dim=hidden_size,
|
||||
dim_head=hidden_size // num_heads,
|
||||
heads=num_heads,
|
||||
qk_norm="rms_norm" if qk_layernorm else None,
|
||||
eps=eps,
|
||||
bias=False,
|
||||
out_bias=False,
|
||||
)
|
||||
self.adaLN_mlp_ln = ErnieImageRMSNorm(hidden_size, eps=eps)
|
||||
self.mlp = ErnieImageFeedForward(hidden_size, ffn_hidden_size)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
x: torch.Tensor,
|
||||
rotary_pos_emb: torch.Tensor,
|
||||
temb: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
) -> torch.Tensor:
|
||||
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = temb
|
||||
residual = x
|
||||
x = self.adaLN_sa_ln(x)
|
||||
x = (x.float() * (1 + scale_msa.float()) + shift_msa.float()).to(x.dtype)
|
||||
x_bsh = x.permute(1, 0, 2)
|
||||
attn_out = self.self_attention(x_bsh, attention_mask=attention_mask, image_rotary_emb=rotary_pos_emb)
|
||||
attn_out = attn_out.permute(1, 0, 2)
|
||||
x = residual + (gate_msa.float() * attn_out.float()).to(x.dtype)
|
||||
residual = x
|
||||
x = self.adaLN_mlp_ln(x)
|
||||
x = (x.float() * (1 + scale_mlp.float()) + shift_mlp.float()).to(x.dtype)
|
||||
return residual + (gate_mlp.float() * self.mlp(x).float()).to(x.dtype)
|
||||
|
||||
|
||||
class ErnieImageAdaLNContinuous(nn.Module):
|
||||
def __init__(self, hidden_size: int, eps: float = 1e-6):
|
||||
super().__init__()
|
||||
self.norm = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=eps)
|
||||
self.linear = nn.Linear(hidden_size, hidden_size * 2)
|
||||
|
||||
def forward(self, x: torch.Tensor, conditioning: torch.Tensor) -> torch.Tensor:
|
||||
scale, shift = self.linear(conditioning).chunk(2, dim=-1)
|
||||
x = self.norm(x)
|
||||
x = x * (1 + scale.unsqueeze(0)) + shift.unsqueeze(0)
|
||||
return x
|
||||
|
||||
|
||||
class ErnieImageDiT(nn.Module):
|
||||
"""
|
||||
Ernie-Image DiT model for DiffSynth-Studio.
|
||||
|
||||
Architecture: SharedAdaLN + RoPE 3D + Joint Image-Text Attention.
|
||||
Internal format: [S, B, H] for transformer blocks, [B, S, H] for attention.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
hidden_size: int = 4096,
|
||||
num_attention_heads: int = 32,
|
||||
num_layers: int = 36,
|
||||
ffn_hidden_size: int = 12288,
|
||||
in_channels: int = 128,
|
||||
out_channels: int = 128,
|
||||
patch_size: int = 1,
|
||||
text_in_dim: int = 3072,
|
||||
rope_theta: int = 256,
|
||||
rope_axes_dim: Tuple[int, int, int] = (32, 48, 48),
|
||||
eps: float = 1e-6,
|
||||
qk_layernorm: bool = True,
|
||||
):
|
||||
super().__init__()
|
||||
self.hidden_size = hidden_size
|
||||
self.num_heads = num_attention_heads
|
||||
self.head_dim = hidden_size // num_attention_heads
|
||||
self.num_layers = num_layers
|
||||
self.patch_size = patch_size
|
||||
self.in_channels = in_channels
|
||||
self.out_channels = out_channels
|
||||
self.text_in_dim = text_in_dim
|
||||
|
||||
self.x_embedder = ErnieImagePatchEmbedDynamic(in_channels, hidden_size, patch_size)
|
||||
self.text_proj = nn.Linear(text_in_dim, hidden_size, bias=False) if text_in_dim != hidden_size else None
|
||||
self.time_proj = Timesteps(hidden_size, flip_sin_to_cos=False, downscale_freq_shift=0)
|
||||
self.time_embedding = TimestepEmbedding(hidden_size, hidden_size)
|
||||
self.pos_embed = ErnieImageEmbedND3(dim=self.head_dim, theta=rope_theta, axes_dim=rope_axes_dim)
|
||||
self.adaLN_modulation = nn.Sequential(nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size))
|
||||
nn.init.zeros_(self.adaLN_modulation[-1].weight)
|
||||
nn.init.zeros_(self.adaLN_modulation[-1].bias)
|
||||
self.layers = nn.ModuleList([
|
||||
ErnieImageSharedAdaLNBlock(hidden_size, num_attention_heads, ffn_hidden_size, eps, qk_layernorm=qk_layernorm)
|
||||
for _ in range(num_layers)
|
||||
])
|
||||
self.final_norm = ErnieImageAdaLNContinuous(hidden_size, eps)
|
||||
self.final_linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels)
|
||||
nn.init.zeros_(self.final_linear.weight)
|
||||
nn.init.zeros_(self.final_linear.bias)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
timestep: torch.Tensor,
|
||||
text_bth: torch.Tensor,
|
||||
text_lens: torch.Tensor,
|
||||
use_gradient_checkpointing: bool = False,
|
||||
use_gradient_checkpointing_offload: bool = False,
|
||||
) -> torch.Tensor:
|
||||
device, dtype = hidden_states.device, hidden_states.dtype
|
||||
B, C, H, W = hidden_states.shape
|
||||
p, Hp, Wp = self.patch_size, H // self.patch_size, W // self.patch_size
|
||||
N_img = Hp * Wp
|
||||
|
||||
img_sbh = self.x_embedder(hidden_states).transpose(0, 1).contiguous()
|
||||
|
||||
if self.text_proj is not None and text_bth.numel() > 0:
|
||||
text_bth = self.text_proj(text_bth)
|
||||
Tmax = text_bth.shape[1]
|
||||
text_sbh = text_bth.transpose(0, 1).contiguous()
|
||||
|
||||
x = torch.cat([img_sbh, text_sbh], dim=0)
|
||||
S = x.shape[0]
|
||||
|
||||
text_ids = torch.cat([
|
||||
torch.arange(Tmax, device=device, dtype=torch.float32).view(1, Tmax, 1).expand(B, -1, -1),
|
||||
torch.zeros((B, Tmax, 2), device=device)
|
||||
], dim=-1) if Tmax > 0 else torch.zeros((B, 0, 3), device=device)
|
||||
grid_yx = torch.stack(
|
||||
torch.meshgrid(torch.arange(Hp, device=device, dtype=torch.float32),
|
||||
torch.arange(Wp, device=device, dtype=torch.float32), indexing="ij"),
|
||||
dim=-1
|
||||
).reshape(-1, 2)
|
||||
image_ids = torch.cat([
|
||||
text_lens.float().view(B, 1, 1).expand(-1, N_img, -1),
|
||||
grid_yx.view(1, N_img, 2).expand(B, -1, -1)
|
||||
], dim=-1)
|
||||
rotary_pos_emb = self.pos_embed(torch.cat([image_ids, text_ids], dim=1))
|
||||
|
||||
valid_text = torch.arange(Tmax, device=device).view(1, Tmax) < text_lens.view(B, 1) if Tmax > 0 else torch.zeros((B, 0), device=device, dtype=torch.bool)
|
||||
attention_mask = torch.cat([
|
||||
torch.ones((B, N_img), device=device, dtype=torch.bool),
|
||||
valid_text
|
||||
], dim=1)[:, None, None, :]
|
||||
|
||||
sample = self.time_proj(timestep.to(dtype))
|
||||
sample = sample.to(self.time_embedding.linear_1.weight.dtype)
|
||||
c = self.time_embedding(sample)
|
||||
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = [
|
||||
t.unsqueeze(0).expand(S, -1, -1).contiguous()
|
||||
for t in self.adaLN_modulation(c).chunk(6, dim=-1)
|
||||
]
|
||||
|
||||
for layer in self.layers:
|
||||
temb = [shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp]
|
||||
if torch.is_grad_enabled() and use_gradient_checkpointing:
|
||||
x = gradient_checkpoint_forward(
|
||||
layer,
|
||||
use_gradient_checkpointing,
|
||||
use_gradient_checkpointing_offload,
|
||||
x,
|
||||
rotary_pos_emb,
|
||||
temb,
|
||||
attention_mask,
|
||||
)
|
||||
else:
|
||||
x = layer(x, rotary_pos_emb, temb, attention_mask)
|
||||
|
||||
x = self.final_norm(x, c).type_as(x)
|
||||
patches = self.final_linear(x)[:N_img].transpose(0, 1).contiguous()
|
||||
output = patches.view(B, Hp, Wp, p, p, self.out_channels).permute(0, 5, 1, 3, 2, 4).contiguous().view(B, self.out_channels, H, W)
|
||||
|
||||
return output
|
||||
76
diffsynth/models/ernie_image_text_encoder.py
Normal file
76
diffsynth/models/ernie_image_text_encoder.py
Normal file
@@ -0,0 +1,76 @@
|
||||
"""
|
||||
Ernie-Image TextEncoder for DiffSynth-Studio.
|
||||
|
||||
Wraps transformers Ministral3Model to output text embeddings.
|
||||
Pattern: lazy import + manual config dict + torch.nn.Module wrapper.
|
||||
Only loads the text (language) model, ignoring vision components.
|
||||
"""
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
class ErnieImageTextEncoder(torch.nn.Module):
|
||||
"""
|
||||
Text encoder using Ministral3Model (transformers).
|
||||
Only the text_config portion of the full Mistral3Model checkpoint.
|
||||
Uses the base model (no lm_head) since the checkpoint only has embeddings.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
from transformers import Ministral3Config, Ministral3Model
|
||||
|
||||
text_config = {
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 1,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": 2,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 3072,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 9216,
|
||||
"max_position_embeddings": 262144,
|
||||
"model_type": "ministral3",
|
||||
"num_attention_heads": 32,
|
||||
"num_hidden_layers": 26,
|
||||
"num_key_value_heads": 8,
|
||||
"pad_token_id": 11,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_parameters": {
|
||||
"beta_fast": 32.0,
|
||||
"beta_slow": 1.0,
|
||||
"factor": 16.0,
|
||||
"llama_4_scaling_beta": 0.1,
|
||||
"mscale": 1.0,
|
||||
"mscale_all_dim": 1.0,
|
||||
"original_max_position_embeddings": 16384,
|
||||
"rope_theta": 1000000.0,
|
||||
"rope_type": "yarn",
|
||||
"type": "yarn",
|
||||
},
|
||||
"sliding_window": None,
|
||||
"tie_word_embeddings": True,
|
||||
"use_cache": True,
|
||||
"vocab_size": 131072,
|
||||
}
|
||||
config = Ministral3Config(**text_config)
|
||||
self.model = Ministral3Model(config)
|
||||
self.config = config
|
||||
|
||||
def forward(
|
||||
self,
|
||||
input_ids=None,
|
||||
attention_mask=None,
|
||||
position_ids=None,
|
||||
**kwargs,
|
||||
):
|
||||
outputs = self.model(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
output_hidden_states=True,
|
||||
return_dict=True,
|
||||
**kwargs,
|
||||
)
|
||||
return (outputs.hidden_states,)
|
||||
265
diffsynth/pipelines/ernie_image.py
Normal file
265
diffsynth/pipelines/ernie_image.py
Normal file
@@ -0,0 +1,265 @@
|
||||
"""
|
||||
ERNIE-Image Text-to-Image Pipeline for DiffSynth-Studio.
|
||||
|
||||
Architecture: SharedAdaLN DiT + RoPE 3D + Joint Image-Text Attention.
|
||||
"""
|
||||
|
||||
import torch
|
||||
from typing import Union, Optional
|
||||
from tqdm import tqdm
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
from ..core.device.npu_compatible_device import get_device_type
|
||||
from ..diffusion import FlowMatchScheduler
|
||||
from ..core import ModelConfig
|
||||
from ..diffusion.base_pipeline import BasePipeline, PipelineUnit
|
||||
from ..models.ernie_image_text_encoder import ErnieImageTextEncoder
|
||||
from ..models.ernie_image_dit import ErnieImageDiT
|
||||
from ..models.flux2_vae import Flux2VAE
|
||||
|
||||
|
||||
class ErnieImagePipeline(BasePipeline):
|
||||
|
||||
def __init__(self, device=get_device_type(), torch_dtype=torch.bfloat16):
|
||||
super().__init__(
|
||||
device=device, torch_dtype=torch_dtype,
|
||||
height_division_factor=16, width_division_factor=16,
|
||||
)
|
||||
self.scheduler = FlowMatchScheduler("ERNIE-Image")
|
||||
self.text_encoder: ErnieImageTextEncoder = None
|
||||
self.dit: ErnieImageDiT = None
|
||||
self.vae: Flux2VAE = None
|
||||
self.tokenizer: AutoTokenizer = None
|
||||
|
||||
self.in_iteration_models = ("dit",)
|
||||
self.units = [
|
||||
ErnieImageUnit_ShapeChecker(),
|
||||
ErnieImageUnit_PromptEmbedder(),
|
||||
ErnieImageUnit_NoiseInitializer(),
|
||||
ErnieImageUnit_InputImageEmbedder(),
|
||||
]
|
||||
self.model_fn = model_fn_ernie_image
|
||||
self.compilable_models = ["dit"]
|
||||
|
||||
@staticmethod
|
||||
def from_pretrained(
|
||||
torch_dtype: torch.dtype = torch.bfloat16,
|
||||
device: Union[str, torch.device] = get_device_type(),
|
||||
model_configs: list[ModelConfig] = [],
|
||||
tokenizer_config: ModelConfig = ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit: float = None,
|
||||
):
|
||||
pipe = ErnieImagePipeline(device=device, torch_dtype=torch_dtype)
|
||||
model_pool = pipe.download_and_load_models(model_configs, vram_limit)
|
||||
|
||||
pipe.text_encoder = model_pool.fetch_model("ernie_image_text_encoder")
|
||||
pipe.dit = model_pool.fetch_model("ernie_image_dit")
|
||||
pipe.vae = model_pool.fetch_model("flux2_vae")
|
||||
|
||||
if tokenizer_config is not None:
|
||||
tokenizer_config.download_if_necessary()
|
||||
pipe.tokenizer = AutoTokenizer.from_pretrained(tokenizer_config.path)
|
||||
|
||||
pipe.vram_management_enabled = pipe.check_vram_management_state()
|
||||
return pipe
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self,
|
||||
# Prompt
|
||||
prompt: str,
|
||||
negative_prompt: str = "",
|
||||
cfg_scale: float = 4.0,
|
||||
# Shape
|
||||
height: int = 1024,
|
||||
width: int = 1024,
|
||||
# Randomness
|
||||
seed: int = None,
|
||||
rand_device: str = "cuda",
|
||||
# Steps
|
||||
num_inference_steps: int = 50,
|
||||
# Progress bar
|
||||
progress_bar_cmd=tqdm,
|
||||
):
|
||||
# Scheduler
|
||||
self.scheduler.set_timesteps(num_inference_steps=num_inference_steps)
|
||||
|
||||
# Parameters
|
||||
inputs_posi = {"prompt": prompt}
|
||||
inputs_nega = {"negative_prompt": negative_prompt}
|
||||
inputs_shared = {
|
||||
"height": height, "width": width, "seed": seed,
|
||||
"cfg_scale": cfg_scale, "num_inference_steps": num_inference_steps,
|
||||
"rand_device": rand_device,
|
||||
}
|
||||
for unit in self.units:
|
||||
inputs_shared, inputs_posi, inputs_nega = self.unit_runner(unit, self, inputs_shared, inputs_posi, inputs_nega)
|
||||
|
||||
# Denoise
|
||||
self.load_models_to_device(self.in_iteration_models)
|
||||
models = {name: getattr(self, name) for name in self.in_iteration_models}
|
||||
for progress_id, timestep in enumerate(progress_bar_cmd(self.scheduler.timesteps)):
|
||||
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)
|
||||
noise_pred = self.cfg_guided_model_fn(
|
||||
self.model_fn, cfg_scale,
|
||||
inputs_shared, inputs_posi, inputs_nega,
|
||||
**models, timestep=timestep, progress_id=progress_id
|
||||
)
|
||||
inputs_shared["latents"] = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **inputs_shared)
|
||||
|
||||
# Decode
|
||||
self.load_models_to_device(['vae'])
|
||||
latents = inputs_shared["latents"]
|
||||
image = self.vae.decode(latents)
|
||||
image = self.vae_output_to_image(image)
|
||||
self.load_models_to_device([])
|
||||
return image
|
||||
|
||||
|
||||
class ErnieImageUnit_ShapeChecker(PipelineUnit):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
input_params=("height", "width"),
|
||||
output_params=("height", "width"),
|
||||
)
|
||||
|
||||
def process(self, pipe: ErnieImagePipeline, height, width):
|
||||
height, width = pipe.check_resize_height_width(height, width)
|
||||
return {"height": height, "width": width}
|
||||
|
||||
|
||||
class ErnieImageUnit_PromptEmbedder(PipelineUnit):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
seperate_cfg=True,
|
||||
input_params_posi={"prompt": "prompt"},
|
||||
input_params_nega={"prompt": "negative_prompt"},
|
||||
output_params=("prompt_embeds", "prompt_embeds_mask"),
|
||||
onload_model_names=("text_encoder",)
|
||||
)
|
||||
|
||||
def encode_prompt(self, pipe: ErnieImagePipeline, prompt):
|
||||
if isinstance(prompt, str):
|
||||
prompt = [prompt]
|
||||
|
||||
text_hiddens = []
|
||||
text_lens_list = []
|
||||
for p in prompt:
|
||||
ids = pipe.tokenizer(
|
||||
p,
|
||||
add_special_tokens=True,
|
||||
truncation=True,
|
||||
padding=False,
|
||||
)["input_ids"]
|
||||
|
||||
if len(ids) == 0:
|
||||
if pipe.tokenizer.bos_token_id is not None:
|
||||
ids = [pipe.tokenizer.bos_token_id]
|
||||
else:
|
||||
ids = [0]
|
||||
|
||||
input_ids = torch.tensor([ids], device=pipe.device)
|
||||
outputs = pipe.text_encoder(
|
||||
input_ids=input_ids,
|
||||
)
|
||||
# Text encoder returns tuple of (hidden_states_tuple,) where each layer's hidden state is included
|
||||
all_hidden_states = outputs[0]
|
||||
hidden = all_hidden_states[-2][0] # [T, H] - second to last layer
|
||||
text_hiddens.append(hidden)
|
||||
text_lens_list.append(hidden.shape[0])
|
||||
|
||||
# Pad to uniform length
|
||||
if len(text_hiddens) == 0:
|
||||
text_in_dim = pipe.text_encoder.config.hidden_size if hasattr(pipe.text_encoder, 'config') else 3072
|
||||
return {
|
||||
"prompt_embeds": torch.zeros((0, 0, text_in_dim), device=pipe.device, dtype=pipe.torch_dtype),
|
||||
"prompt_embeds_mask": torch.zeros((0,), device=pipe.device, dtype=torch.long),
|
||||
}
|
||||
|
||||
normalized = [th.to(pipe.device).to(pipe.torch_dtype) for th in text_hiddens]
|
||||
text_lens = torch.tensor([t.shape[0] for t in normalized], device=pipe.device, dtype=torch.long)
|
||||
Tmax = int(text_lens.max().item())
|
||||
text_in_dim = normalized[0].shape[1]
|
||||
text_bth = torch.zeros((len(normalized), Tmax, text_in_dim), device=pipe.device, dtype=pipe.torch_dtype)
|
||||
for i, t in enumerate(normalized):
|
||||
text_bth[i, :t.shape[0], :] = t
|
||||
|
||||
return {"prompt_embeds": text_bth, "prompt_embeds_mask": text_lens}
|
||||
|
||||
def process(self, pipe: ErnieImagePipeline, prompt):
|
||||
pipe.load_models_to_device(self.onload_model_names)
|
||||
if pipe.text_encoder is not None:
|
||||
return self.encode_prompt(pipe, prompt)
|
||||
return {}
|
||||
|
||||
|
||||
class ErnieImageUnit_NoiseInitializer(PipelineUnit):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
input_params=("height", "width", "seed", "rand_device"),
|
||||
output_params=("noise",),
|
||||
)
|
||||
|
||||
def process(self, pipe: ErnieImagePipeline, height, width, seed, rand_device):
|
||||
latent_h = height // pipe.height_division_factor
|
||||
latent_w = width // pipe.width_division_factor
|
||||
latent_channels = pipe.dit.in_channels
|
||||
|
||||
# Use pipeline device if rand_device is not specified
|
||||
if rand_device is None:
|
||||
rand_device = str(pipe.device)
|
||||
|
||||
noise = pipe.generate_noise(
|
||||
(1, latent_channels, latent_h, latent_w),
|
||||
seed=seed,
|
||||
rand_device=rand_device,
|
||||
rand_torch_dtype=pipe.torch_dtype,
|
||||
)
|
||||
return {"noise": noise}
|
||||
|
||||
|
||||
class ErnieImageUnit_InputImageEmbedder(PipelineUnit):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
input_params=("input_image", "noise"),
|
||||
output_params=("latents", "input_latents"),
|
||||
onload_model_names=("vae",)
|
||||
)
|
||||
|
||||
def process(self, pipe: ErnieImagePipeline, input_image, noise):
|
||||
if input_image is None:
|
||||
# T2I path: use noise directly as initial latents
|
||||
return {"latents": noise, "input_latents": None}
|
||||
|
||||
# I2I path: VAE encode input image
|
||||
pipe.load_models_to_device(['vae'])
|
||||
image = pipe.preprocess_image(input_image).to(device=pipe.device, dtype=pipe.torch_dtype)
|
||||
input_latents = pipe.vae.encode(image)
|
||||
|
||||
if pipe.scheduler.training:
|
||||
return {"latents": noise, "input_latents": input_latents}
|
||||
else:
|
||||
# In inference mode, add noise to encoded latents
|
||||
latents = pipe.scheduler.add_noise(input_latents, noise, timestep=pipe.scheduler.timesteps[0])
|
||||
return {"latents": latents}
|
||||
|
||||
|
||||
def model_fn_ernie_image(
|
||||
dit: ErnieImageDiT,
|
||||
latents=None,
|
||||
timestep=None,
|
||||
prompt_embeds=None,
|
||||
prompt_embeds_mask=None,
|
||||
use_gradient_checkpointing=False,
|
||||
use_gradient_checkpointing_offload=False,
|
||||
**kwargs,
|
||||
):
|
||||
output = dit(
|
||||
hidden_states=latents,
|
||||
timestep=timestep,
|
||||
text_bth=prompt_embeds,
|
||||
text_lens=prompt_embeds_mask,
|
||||
use_gradient_checkpointing=use_gradient_checkpointing,
|
||||
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
|
||||
)
|
||||
return output
|
||||
@@ -0,0 +1,21 @@
|
||||
def ErnieImageTextEncoderStateDictConverter(state_dict):
|
||||
"""
|
||||
Maps checkpoint keys from multimodal Mistral3Model format
|
||||
to text-only Ministral3Model format.
|
||||
|
||||
Checkpoint keys (Mistral3Model):
|
||||
language_model.model.layers.0.input_layernorm.weight
|
||||
language_model.model.norm.weight
|
||||
|
||||
Model keys (ErnieImageTextEncoder → self.model = Ministral3Model):
|
||||
model.layers.0.input_layernorm.weight
|
||||
model.norm.weight
|
||||
|
||||
Mapping: language_model. → model.
|
||||
"""
|
||||
new_state_dict = {}
|
||||
for key in state_dict:
|
||||
if key.startswith("language_model.model."):
|
||||
new_key = key.replace("language_model.model.", "model.", 1)
|
||||
new_state_dict[new_key] = state_dict[key]
|
||||
return new_state_dict
|
||||
133
docs/en/Model_Details/ERNIE-Image.md
Normal file
133
docs/en/Model_Details/ERNIE-Image.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# ERNIE-Image
|
||||
|
||||
ERNIE-Image is a powerful image generation model with 8B parameters developed by Baidu, featuring a compact and efficient architecture with strong instruction-following capability. Based on an 8B DiT backbone, it delivers performance comparable to larger (20B+) models in certain scenarios while maintaining parameter efficiency. It offers reliable performance in instruction understanding and execution, text generation (English/Chinese/Japanese), and overall stability.
|
||||
|
||||
## Installation
|
||||
|
||||
Before performing model inference and training, please install DiffSynth-Studio first.
|
||||
|
||||
```shell
|
||||
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
||||
cd DiffSynth-Studio
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
For more information on installation, please refer to [Setup Dependencies](../Pipeline_Usage/Setup.md).
|
||||
|
||||
## Quick Start
|
||||
|
||||
Running the following code will load the [baidu/ERNIE-Image](https://www.modelscope.cn/models/baidu/ERNIE-Image) model for inference. VRAM management is enabled, the framework automatically controls parameter loading based on available VRAM, requiring a minimum of 3G VRAM.
|
||||
|
||||
```python
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
vram_config = {
|
||||
"offload_dtype": torch.bfloat16,
|
||||
"offload_device": "cpu",
|
||||
"onload_dtype": torch.bfloat16,
|
||||
"onload_device": "cpu",
|
||||
"preparing_dtype": torch.bfloat16,
|
||||
"preparing_device": "cuda",
|
||||
"computation_dtype": torch.bfloat16,
|
||||
"computation_device": "cuda",
|
||||
}
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
```
|
||||
|
||||
## Model Overview
|
||||
|
||||
|Model ID|Inference|Low VRAM Inference|Full Training|Full Training Validation|LoRA Training|LoRA Training Validation|
|
||||
|-|-|-|-|-|-|-|
|
||||
|[baidu/ERNIE-Image: T2I](https://www.modelscope.cn/models/baidu/ERNIE-Image)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference_low_vram/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/full/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_full/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_lora/Ernie-Image-T2I.py)|
|
||||
|
||||
## Model Inference
|
||||
|
||||
The model is loaded via `ErnieImagePipeline.from_pretrained`, see [Loading Models](../Pipeline_Usage/Model_Inference.md#loading-models) for details.
|
||||
|
||||
The input parameters for `ErnieImagePipeline` inference include:
|
||||
|
||||
* `prompt`: The prompt describing the content to appear in the image.
|
||||
* `negative_prompt`: The negative prompt describing what should not appear in the image, default value is `""`.
|
||||
* `cfg_scale`: Classifier-free guidance parameter, default value is 4.0.
|
||||
* `height`: Image height, must be a multiple of 16, default value is 1024.
|
||||
* `width`: Image width, must be a multiple of 16, default value is 1024.
|
||||
* `seed`: Random seed. Default is `None`, meaning completely random.
|
||||
* `rand_device`: The computing device for generating random Gaussian noise matrices, default is `"cuda"`. When set to `cuda`, different GPUs will produce different results.
|
||||
* `num_inference_steps`: Number of inference steps, default value is 50.
|
||||
|
||||
If VRAM is insufficient, please enable [VRAM Management](../Pipeline_Usage/VRAM_management.md). We provide recommended low-VRAM configurations for each model in the "Model Overview" table above.
|
||||
|
||||
## Model Training
|
||||
|
||||
ERNIE-Image series models are trained uniformly via [`examples/ernie_image/model_training/train.py`](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/train.py). The script parameters include:
|
||||
|
||||
* General Training Parameters
|
||||
* Dataset Configuration
|
||||
* `--dataset_base_path`: Root directory of the dataset.
|
||||
* `--dataset_metadata_path`: Path to the dataset metadata file.
|
||||
* `--dataset_repeat`: Number of dataset repeats per epoch.
|
||||
* `--dataset_num_workers`: Number of processes per DataLoader.
|
||||
* `--data_file_keys`: Field names to load from metadata, typically paths to image or video files, separated by `,`.
|
||||
* Model Loading Configuration
|
||||
* `--model_paths`: Paths to load models from, in JSON format.
|
||||
* `--model_id_with_origin_paths`: Model IDs with original paths, e.g., `"baidu/ERNIE-Image:transformer/diffusion_pytorch_model*.safetensors"`, separated by commas.
|
||||
* `--extra_inputs`: Additional input parameters required by the model Pipeline, separated by `,`.
|
||||
* `--fp8_models`: Models to load in FP8 format, currently only supported for models whose parameters are not updated by gradients.
|
||||
* Basic Training Configuration
|
||||
* `--learning_rate`: Learning rate.
|
||||
* `--num_epochs`: Number of epochs.
|
||||
* `--trainable_models`: Trainable models, e.g., `dit`, `vae`, `text_encoder`.
|
||||
* `--find_unused_parameters`: Whether unused parameters exist in DDP training.
|
||||
* `--weight_decay`: Weight decay magnitude.
|
||||
* `--task`: Training task, defaults to `sft`.
|
||||
* Output Configuration
|
||||
* `--output_path`: Path to save the model.
|
||||
* `--remove_prefix_in_ckpt`: Remove prefix in the model's state dict.
|
||||
* `--save_steps`: Interval in training steps to save the model.
|
||||
* LoRA Configuration
|
||||
* `--lora_base_model`: Which model to add LoRA to.
|
||||
* `--lora_target_modules`: Which layers to add LoRA to.
|
||||
* `--lora_rank`: Rank of LoRA.
|
||||
* `--lora_checkpoint`: Path to LoRA checkpoint.
|
||||
* `--preset_lora_path`: Path to preset LoRA checkpoint for LoRA differential training.
|
||||
* `--preset_lora_model`: Which model to integrate preset LoRA into, e.g., `dit`.
|
||||
* Gradient Configuration
|
||||
* `--use_gradient_checkpointing`: Whether to enable gradient checkpointing.
|
||||
* `--use_gradient_checkpointing_offload`: Whether to offload gradient checkpointing to CPU memory.
|
||||
* `--gradient_accumulation_steps`: Number of gradient accumulation steps.
|
||||
* Resolution Configuration
|
||||
* `--height`: Height of the image. Leave empty to enable dynamic resolution.
|
||||
* `--width`: Width of the image. Leave empty to enable dynamic resolution.
|
||||
* `--max_pixels`: Maximum pixel area, images larger than this will be scaled down during dynamic resolution.
|
||||
* ERNIE-Image Specific Parameters
|
||||
* `--tokenizer_path`: Path to the tokenizer, leave empty to auto-download from remote.
|
||||
|
||||
We provide an example image dataset for testing, which can be downloaded with the following command:
|
||||
|
||||
```shell
|
||||
modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --local_dir ./data/diffsynth_example_dataset
|
||||
```
|
||||
|
||||
We provide recommended training scripts for each model, please refer to the table in "Model Overview" above. For guidance on writing model training scripts, see [Model Training](../Pipeline_Usage/Model_Training.md); for more advanced training algorithms, see [Training Framework Overview](https://github.com/modelscope/DiffSynth-Studio/tree/main/docs/en/Training/).
|
||||
@@ -29,6 +29,7 @@ Welcome to DiffSynth-Studio's Documentation
|
||||
Model_Details/Z-Image
|
||||
Model_Details/Anima
|
||||
Model_Details/LTX-2
|
||||
Model_Details/ERNIE-Image
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
133
docs/zh/Model_Details/ERNIE-Image.md
Normal file
133
docs/zh/Model_Details/ERNIE-Image.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# ERNIE-Image
|
||||
|
||||
ERNIE-Image 是百度推出的拥有 8B 参数的图像生成模型,具有紧凑高效的架构和出色的指令跟随能力。基于 8B DiT 主干网络,其在某些场景下的性能可与 20B 以上的更大模型相媲美,同时保持了良好的参数效率。该模型在指令理解与执行、文本生成(如英文/中文/日文)以及整体稳定性方面提供了较为可靠的表现。
|
||||
|
||||
## 安装
|
||||
|
||||
在使用本项目进行模型推理和训练前,请先安装 DiffSynth-Studio。
|
||||
|
||||
```shell
|
||||
git clone https://github.com/modelscope/DiffSynth-Studio.git
|
||||
cd DiffSynth-Studio
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
更多关于安装的信息,请参考[安装依赖](../Pipeline_Usage/Setup.md)。
|
||||
|
||||
## 快速开始
|
||||
|
||||
运行以下代码可以快速加载 [baidu/ERNIE-Image](https://www.modelscope.cn/models/baidu/ERNIE-Image) 模型并进行推理。显存管理已启动,框架会自动根据剩余显存控制模型参数的加载,最低 3G 显存即可运行。
|
||||
|
||||
```python
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
vram_config = {
|
||||
"offload_dtype": torch.bfloat16,
|
||||
"offload_device": "cpu",
|
||||
"onload_dtype": torch.bfloat16,
|
||||
"onload_device": "cpu",
|
||||
"preparing_dtype": torch.bfloat16,
|
||||
"preparing_device": "cuda",
|
||||
"computation_dtype": torch.bfloat16,
|
||||
"computation_device": "cuda",
|
||||
}
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
```
|
||||
|
||||
## 模型总览
|
||||
|
||||
|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
|
||||
|-|-|-|-|-|-|-|
|
||||
|[baidu/ERNIE-Image: T2I](https://www.modelscope.cn/models/baidu/ERNIE-Image)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_inference_low_vram/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/full/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_full/Ernie-Image-T2I.py)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh)|[code](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/validate_lora/Ernie-Image-T2I.py)|
|
||||
|
||||
## 模型推理
|
||||
|
||||
模型通过 `ErnieImagePipeline.from_pretrained` 加载,详见[加载模型](../Pipeline_Usage/Model_Inference.md#加载模型)。
|
||||
|
||||
`ErnieImagePipeline` 推理的输入参数包括:
|
||||
|
||||
* `prompt`: 提示词,描述画面中出现的内容。
|
||||
* `negative_prompt`: 负向提示词,描述画面中不应该出现的内容,默认值为 `""`。
|
||||
* `cfg_scale`: Classifier-free guidance 的参数,默认值为 4.0。
|
||||
* `height`: 图像高度,需保证高度为 16 的倍数,默认值为 1024。
|
||||
* `width`: 图像宽度,需保证宽度为 16 的倍数,默认值为 1024。
|
||||
* `seed`: 随机种子。默认为 `None`,即完全随机。
|
||||
* `rand_device`: 生成随机高斯噪声矩阵的计算设备,默认为 `"cuda"`。当设置为 `cuda` 时,在不同 GPU 上会导致不同的生成结果。
|
||||
* `num_inference_steps`: 推理步数,默认值为 50。
|
||||
|
||||
如果显存不足,请开启[显存管理](../Pipeline_Usage/VRAM_management.md),我们在示例代码中提供了每个模型推荐的低显存配置,详见前文"模型总览"中的表格。
|
||||
|
||||
## 模型训练
|
||||
|
||||
ERNIE-Image 系列模型统一通过 [`examples/ernie_image/model_training/train.py`](https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/ernie_image/model_training/train.py) 进行训练,脚本的参数包括:
|
||||
|
||||
* 通用训练参数
|
||||
* 数据集基础配置
|
||||
* `--dataset_base_path`: 数据集的根目录。
|
||||
* `--dataset_metadata_path`: 数据集的元数据文件路径。
|
||||
* `--dataset_repeat`: 每个 epoch 中数据集重复的次数。
|
||||
* `--dataset_num_workers`: 每个 Dataloader 的进程数量。
|
||||
* `--data_file_keys`: 元数据中需要加载的字段名称,通常是图像或视频文件的路径,以 `,` 分隔。
|
||||
* 模型加载配置
|
||||
* `--model_paths`: 要加载的模型路径。JSON 格式。
|
||||
* `--model_id_with_origin_paths`: 带原始路径的模型 ID,例如 `"baidu/ERNIE-Image:transformer/diffusion_pytorch_model*.safetensors"`。用逗号分隔。
|
||||
* `--extra_inputs`: 模型 Pipeline 所需的额外输入参数,以 `,` 分隔。
|
||||
* `--fp8_models`:以 FP8 格式加载的模型,目前仅支持参数不被梯度更新的模型。
|
||||
* 训练基础配置
|
||||
* `--learning_rate`: 学习率。
|
||||
* `--num_epochs`: 轮数(Epoch)。
|
||||
* `--trainable_models`: 可训练的模型,例如 `dit`、`vae`、`text_encoder`。
|
||||
* `--find_unused_parameters`: DDP 训练中是否存在未使用的参数。
|
||||
* `--weight_decay`:权重衰减大小。
|
||||
* `--task`: 训练任务,默认为 `sft`。
|
||||
* 输出配置
|
||||
* `--output_path`: 模型保存路径。
|
||||
* `--remove_prefix_in_ckpt`: 在模型文件的 state dict 中移除前缀。
|
||||
* `--save_steps`: 保存模型的训练步数间隔。
|
||||
* LoRA 配置
|
||||
* `--lora_base_model`: LoRA 添加到哪个模型上。
|
||||
* `--lora_target_modules`: LoRA 添加到哪些层上。
|
||||
* `--lora_rank`: LoRA 的秩(Rank)。
|
||||
* `--lora_checkpoint`: LoRA 检查点的路径。
|
||||
* `--preset_lora_path`: 预置 LoRA 检查点路径,用于 LoRA 差分训练。
|
||||
* `--preset_lora_model`: 预置 LoRA 融入的模型,例如 `dit`。
|
||||
* 梯度配置
|
||||
* `--use_gradient_checkpointing`: 是否启用 gradient checkpointing。
|
||||
* `--use_gradient_checkpointing_offload`: 是否将 gradient checkpointing 卸载到内存中。
|
||||
* `--gradient_accumulation_steps`: 梯度累积步数。
|
||||
* 分辨率配置
|
||||
* `--height`: 图像的高度。留空启用动态分辨率。
|
||||
* `--width`: 图像的宽度。留空启用动态分辨率。
|
||||
* `--max_pixels`: 最大像素面积,动态分辨率时大于此值的图片会被缩小。
|
||||
* ERNIE-Image 专有参数
|
||||
* `--tokenizer_path`: tokenizer 的路径,留空则自动从远程下载。
|
||||
|
||||
我们构建了一个样例图像数据集,以方便您进行测试,通过以下命令可以下载这个数据集:
|
||||
|
||||
```shell
|
||||
modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --local_dir ./data/diffsynth_example_dataset
|
||||
```
|
||||
|
||||
我们为每个模型编写了推荐的训练脚本,请参考前文"模型总览"中的表格。关于如何编写模型训练脚本,请参考[模型训练](../Pipeline_Usage/Model_Training.md);更多高阶训练算法,请参考[训练框架详解](https://github.com/modelscope/DiffSynth-Studio/tree/main/docs/zh/Training/)。
|
||||
@@ -29,6 +29,7 @@
|
||||
Model_Details/Z-Image
|
||||
Model_Details/Anima
|
||||
Model_Details/LTX-2
|
||||
Model_Details/ERNIE-Image
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
24
examples/ernie_image/model_inference/Ernie-Image-T2I.py
Normal file
24
examples/ernie_image/model_inference/Ernie-Image-T2I.py
Normal file
@@ -0,0 +1,24 @@
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
@@ -0,0 +1,36 @@
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
import torch
|
||||
|
||||
vram_config = {
|
||||
"offload_dtype": torch.bfloat16,
|
||||
"offload_device": "cpu",
|
||||
"onload_dtype": torch.bfloat16,
|
||||
"onload_device": "cpu",
|
||||
"preparing_dtype": torch.bfloat16,
|
||||
"preparing_device": "cuda",
|
||||
"computation_dtype": torch.bfloat16,
|
||||
"computation_device": "cuda",
|
||||
}
|
||||
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device='cuda',
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
|
||||
],
|
||||
tokenizer_config=ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/"),
|
||||
vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt="一只黑白相间的中华田园犬",
|
||||
negative_prompt="",
|
||||
height=1024,
|
||||
width=1024,
|
||||
seed=42,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("output.jpg")
|
||||
17
examples/ernie_image/model_training/full/Ernie-Image-T2I.sh
Normal file
17
examples/ernie_image/model_training/full/Ernie-Image-T2I.sh
Normal file
@@ -0,0 +1,17 @@
|
||||
# Dataset: data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I/
|
||||
|
||||
accelerate launch --config_file examples/ernie_image/model_training/full/accelerate_config_zero3.yaml \
|
||||
examples/ernie_image/model_training/train.py \
|
||||
--dataset_base_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I \
|
||||
--dataset_metadata_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I/metadata.csv \
|
||||
--max_pixels 1048576 \
|
||||
--dataset_repeat 50 \
|
||||
--model_id_with_origin_paths "baidu/ERNIE-Image:transformer/diffusion_pytorch_model*.safetensors,baidu/ERNIE-Image:text_encoder/model.safetensors,baidu/ERNIE-Image:vae/diffusion_pytorch_model.safetensors" \
|
||||
--learning_rate 1e-5 \
|
||||
--num_epochs 2 \
|
||||
--remove_prefix_in_ckpt "pipe.dit." \
|
||||
--output_path "./models/train/Ernie-Image-T2I_full" \
|
||||
--trainable_models "dit" \
|
||||
--use_gradient_checkpointing \
|
||||
--dataset_num_workers 8 \
|
||||
--find_unused_parameters
|
||||
@@ -0,0 +1,23 @@
|
||||
compute_environment: LOCAL_MACHINE
|
||||
debug: false
|
||||
deepspeed_config:
|
||||
gradient_accumulation_steps: 1
|
||||
offload_optimizer_device: none
|
||||
offload_param_device: none
|
||||
zero3_init_flag: true
|
||||
zero3_save_16bit_model: true
|
||||
zero_stage: 3
|
||||
distributed_type: DEEPSPEED
|
||||
downcast_bf16: 'no'
|
||||
enable_cpu_affinity: false
|
||||
machine_rank: 0
|
||||
main_training_function: main
|
||||
mixed_precision: bf16
|
||||
num_machines: 1
|
||||
num_processes: 8
|
||||
rdzv_backend: static
|
||||
same_network: true
|
||||
tpu_env: []
|
||||
tpu_use_cluster: false
|
||||
tpu_use_sudo: false
|
||||
use_cpu: false
|
||||
19
examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh
Normal file
19
examples/ernie_image/model_training/lora/Ernie-Image-T2I.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
# Dataset: data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I/
|
||||
# Download: modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "ernie_image/Ernie-Image-T2I/*" --local_dir ./data/diffsynth_example_dataset
|
||||
|
||||
accelerate launch examples/ernie_image/model_training/train.py \
|
||||
--dataset_base_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I \
|
||||
--dataset_metadata_path data/diffsynth_example_dataset/ernie_image/Ernie-Image-T2I/metadata.csv \
|
||||
--max_pixels 1048576 \
|
||||
--dataset_repeat 50 \
|
||||
--model_id_with_origin_paths "baidu/ERNIE-Image:transformer/diffusion_pytorch_model*.safetensors,baidu/ERNIE-Image:text_encoder/model.safetensors,baidu/ERNIE-Image:vae/diffusion_pytorch_model.safetensors" \
|
||||
--learning_rate 1e-4 \
|
||||
--num_epochs 5 \
|
||||
--remove_prefix_in_ckpt "pipe.dit." \
|
||||
--output_path "./models/train/Ernie-Image-T2I_lora" \
|
||||
--lora_base_model "dit" \
|
||||
--lora_target_modules "to_q,to_k,to_v,to_out.0" \
|
||||
--lora_rank 32 \
|
||||
--use_gradient_checkpointing \
|
||||
--dataset_num_workers 8 \
|
||||
--find_unused_parameters
|
||||
129
examples/ernie_image/model_training/train.py
Normal file
129
examples/ernie_image/model_training/train.py
Normal file
@@ -0,0 +1,129 @@
|
||||
import torch, os, argparse, accelerate
|
||||
from diffsynth.core import UnifiedDataset
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
from diffsynth.diffusion import *
|
||||
from diffsynth.core.data.operators import *
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
|
||||
|
||||
class ErnieImageTrainingModule(DiffusionTrainingModule):
|
||||
def __init__(
|
||||
self,
|
||||
model_paths=None, model_id_with_origin_paths=None,
|
||||
tokenizer_path=None,
|
||||
trainable_models=None,
|
||||
lora_base_model=None, lora_target_modules="", lora_rank=32, lora_checkpoint=None,
|
||||
preset_lora_path=None, preset_lora_model=None,
|
||||
use_gradient_checkpointing=True,
|
||||
use_gradient_checkpointing_offload=False,
|
||||
extra_inputs=None,
|
||||
fp8_models=None,
|
||||
offload_models=None,
|
||||
device="cpu",
|
||||
task="sft",
|
||||
):
|
||||
super().__init__()
|
||||
# Load models
|
||||
model_configs = self.parse_model_configs(model_paths, model_id_with_origin_paths, fp8_models=fp8_models, offload_models=offload_models, device=device)
|
||||
tokenizer_config = ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="tokenizer/") if tokenizer_path is None else ModelConfig(tokenizer_path)
|
||||
self.pipe = ErnieImagePipeline.from_pretrained(torch_dtype=torch.bfloat16, device=device, model_configs=model_configs, tokenizer_config=tokenizer_config)
|
||||
self.pipe = self.split_pipeline_units(task, self.pipe, trainable_models, lora_base_model)
|
||||
|
||||
# Training mode
|
||||
self.switch_pipe_to_training_mode(
|
||||
self.pipe, trainable_models,
|
||||
lora_base_model, lora_target_modules, lora_rank, lora_checkpoint,
|
||||
preset_lora_path, preset_lora_model,
|
||||
task=task,
|
||||
)
|
||||
|
||||
# Other configs
|
||||
self.use_gradient_checkpointing = use_gradient_checkpointing
|
||||
self.use_gradient_checkpointing_offload = use_gradient_checkpointing_offload
|
||||
self.extra_inputs = extra_inputs.split(",") if extra_inputs is not None else []
|
||||
self.task = task
|
||||
self.task_to_loss = {
|
||||
"sft:data_process": lambda pipe, inputs_shared, inputs_posi, inputs_nega: (inputs_shared, inputs_posi, inputs_nega),
|
||||
"sft": lambda pipe, inputs_shared, inputs_posi, inputs_nega: FlowMatchSFTLoss(pipe, **inputs_shared, **inputs_posi),
|
||||
"sft:train": lambda pipe, inputs_shared, inputs_posi, inputs_nega: FlowMatchSFTLoss(pipe, **inputs_shared, **inputs_posi),
|
||||
}
|
||||
|
||||
def get_pipeline_inputs(self, data):
|
||||
inputs_posi = {"prompt": data["prompt"]}
|
||||
inputs_nega = {"negative_prompt": ""}
|
||||
inputs_shared = {
|
||||
"input_image": data["image"],
|
||||
"height": data["image"].size[1],
|
||||
"width": data["image"].size[0],
|
||||
"cfg_scale": 1,
|
||||
"rand_device": self.pipe.device,
|
||||
"use_gradient_checkpointing": self.use_gradient_checkpointing,
|
||||
"use_gradient_checkpointing_offload": self.use_gradient_checkpointing_offload,
|
||||
}
|
||||
inputs_shared = self.parse_extra_inputs(data, self.extra_inputs, inputs_shared)
|
||||
return inputs_shared, inputs_posi, inputs_nega
|
||||
|
||||
def forward(self, data, inputs=None):
|
||||
if inputs is None:
|
||||
inputs = self.get_pipeline_inputs(data)
|
||||
inputs = self.transfer_data_to_device(inputs, self.pipe.device, self.pipe.torch_dtype)
|
||||
for unit in self.pipe.units:
|
||||
inputs = self.pipe.unit_runner(unit, self.pipe, *inputs)
|
||||
loss = self.task_to_loss[self.task](self.pipe, *inputs)
|
||||
return loss
|
||||
|
||||
|
||||
def ernie_image_parser():
|
||||
parser = argparse.ArgumentParser(description="ERNIE-Image training.")
|
||||
parser = add_general_config(parser)
|
||||
parser = add_image_size_config(parser)
|
||||
parser.add_argument("--tokenizer_path", type=str, default=None, help="Path to tokenizer.")
|
||||
return parser
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = ernie_image_parser()
|
||||
args = parser.parse_args()
|
||||
accelerator = accelerate.Accelerator(
|
||||
gradient_accumulation_steps=args.gradient_accumulation_steps,
|
||||
kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=args.find_unused_parameters)],
|
||||
)
|
||||
dataset = UnifiedDataset(
|
||||
base_path=args.dataset_base_path,
|
||||
metadata_path=args.dataset_metadata_path,
|
||||
repeat=args.dataset_repeat,
|
||||
data_file_keys=args.data_file_keys.split(","),
|
||||
main_data_operator=lambda x: x,
|
||||
special_operator_map={
|
||||
"image": ToAbsolutePath(args.dataset_base_path) >> LoadImage() >> ImageCropAndResize(args.height, args.width, args.max_pixels, 16, 16),
|
||||
},
|
||||
)
|
||||
model = ErnieImageTrainingModule(
|
||||
model_paths=args.model_paths,
|
||||
model_id_with_origin_paths=args.model_id_with_origin_paths,
|
||||
tokenizer_path=args.tokenizer_path,
|
||||
trainable_models=args.trainable_models,
|
||||
lora_base_model=args.lora_base_model,
|
||||
lora_target_modules=args.lora_target_modules,
|
||||
lora_rank=args.lora_rank,
|
||||
lora_checkpoint=args.lora_checkpoint,
|
||||
preset_lora_path=args.preset_lora_path,
|
||||
preset_lora_model=args.preset_lora_model,
|
||||
use_gradient_checkpointing=args.use_gradient_checkpointing,
|
||||
use_gradient_checkpointing_offload=args.use_gradient_checkpointing_offload,
|
||||
extra_inputs=args.extra_inputs,
|
||||
fp8_models=args.fp8_models,
|
||||
offload_models=args.offload_models,
|
||||
task=args.task,
|
||||
device=accelerator.device,
|
||||
)
|
||||
model_logger = ModelLogger(
|
||||
args.output_path,
|
||||
remove_prefix_in_ckpt=args.remove_prefix_in_ckpt,
|
||||
)
|
||||
launcher_map = {
|
||||
"sft:data_process": launch_data_process_task,
|
||||
"sft": launch_training_task,
|
||||
"sft:train": launch_training_task,
|
||||
}
|
||||
launcher_map[args.task](accelerator, dataset, model, model_logger, args=args)
|
||||
@@ -0,0 +1,25 @@
|
||||
import torch
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
from diffsynth.core import load_state_dict
|
||||
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device="cuda",
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
|
||||
],
|
||||
)
|
||||
|
||||
state_dict = load_state_dict("./models/train/Ernie-Image-T2I_full/epoch-1.safetensors")
|
||||
pipe.dit.load_state_dict(state_dict)
|
||||
|
||||
image = pipe(
|
||||
prompt="a professional photo of a cute dog",
|
||||
seed=0,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("image_full.jpg")
|
||||
print("Full validation image saved to image_full.jpg")
|
||||
@@ -0,0 +1,25 @@
|
||||
import torch
|
||||
from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig
|
||||
from diffsynth.core.loader.file import load_state_dict
|
||||
|
||||
pipe = ErnieImagePipeline.from_pretrained(
|
||||
torch_dtype=torch.bfloat16,
|
||||
device="cuda",
|
||||
model_configs=[
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors"),
|
||||
ModelConfig(model_id="baidu/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
|
||||
],
|
||||
)
|
||||
|
||||
lora_state_dict = load_state_dict("./models/train/Ernie-Image-T2I_lora/epoch-4.safetensors", torch_dtype=torch.bfloat16, device="cuda")
|
||||
pipe.load_lora(pipe.dit, state_dict=lora_state_dict, alpha=1.0)
|
||||
|
||||
image = pipe(
|
||||
prompt="a professional photo of a cute dog",
|
||||
seed=0,
|
||||
num_inference_steps=50,
|
||||
cfg_scale=4.0,
|
||||
)
|
||||
image.save("image_lora.jpg")
|
||||
print("LoRA validation image saved to image_lora.jpg")
|
||||
Reference in New Issue
Block a user