Files
DiffSynth-Studio/docs/en/Model_Details/Qwen-Image.md
Zhongjie Duan ba0626e38f add example_dataset in training scripts (#1358)
* add example_dataset in training scripts

* fix example datasets
2026-03-18 15:37:03 +08:00

29 KiB

Qwen-Image

Image

Qwen-Image is an image generation model trained and open-sourced by the Tongyi Lab Qwen Team of Alibaba.

Installation

Before using this project for model inference and training, please install DiffSynth-Studio first.

git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .

For more information about installation, please refer to Install Dependencies.

Quick Start

Run the following code to quickly load the Qwen/Qwen-Image model and perform inference. VRAM management is enabled, and the framework will automatically control model parameter loading based on remaining VRAM. Minimum 8GB VRAM is required to run.

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
import torch

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.float8_e4m3fn,
    "onload_device": "cpu",
    "preparing_dtype": torch.float8_e4m3fn,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "精致肖像,水下少女,蓝裙飘逸,发丝轻扬,光影透澈,气泡环绕,面容恬静,细节精致,梦幻唯美。"
image = pipe(prompt, seed=0, num_inference_steps=40)
image.save("image.jpg")

Model Overview

Model Lineage
graph LR;
    Qwen/Qwen-Image-->Qwen/Qwen-Image-Edit;
    Qwen/Qwen-Image-Edit-->Qwen/Qwen-Image-Edit-2509;
    Qwen/Qwen-Image-->EliGen-Series;
    EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen;
    DiffSynth-Studio/Qwen-Image-EliGen-->DiffSynth-Studio/Qwen-Image-EliGen-V2;
    EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen-Poster;
    Qwen/Qwen-Image-->Distill-Series;
    Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-Full;
    Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-LoRA;
    Qwen/Qwen-Image-->ControlNet-Series;
    ControlNet-Series-->Blockwise-ControlNet-Series;
    Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny;
    Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth;
    Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint;
    ControlNet-Series-->DiffSynth-Studio/Qwen-Image-In-Context-Control-Union;
    Qwen/Qwen-Image-->DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix;
Model ID Inference Low VRAM Inference Full Training Validation After Full Training LoRA Training Validation After LoRA Training
Qwen/Qwen-Image code code code code code code
Qwen/Qwen-Image-2512 code code code code code code
Qwen/Qwen-Image-Edit code code code code code code
Qwen/Qwen-Image-Edit-2509 code code code code code code
Qwen/Qwen-Image-Edit-2511 code code code code code code
FireRedTeam/FireRed-Image-Edit-1.0 code code code code code code
FireRedTeam/FireRed-Image-Edit-1.1 code code code code code code
lightx2v/Qwen-Image-Edit-2511-Lightning code code - - - -
Qwen/Qwen-Image-Layered code code code code code code
DiffSynth-Studio/Qwen-Image-Layered-Control code code code code code code
DiffSynth-Studio/Qwen-Image-Layered-Control-V2 code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen-V2 code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen-Poster code code - - code code
DiffSynth-Studio/Qwen-Image-Distill-Full code code code code code code
DiffSynth-Studio/Qwen-Image-Distill-LoRA code code - - code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny code code code code code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth code code code code code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint code code code code code code
DiffSynth-Studio/Qwen-Image-In-Context-Control-Union code code - - code code
DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix code code - - - -
DiffSynth-Studio/Qwen-Image-i2L code code - - - -

Special Training Scripts:

  • Differential LoRA Training: doc, code
  • FP8 Precision Training: doc, code
  • Two-stage Split Training: doc, code
  • End-to-end Direct Distillation: doc, code

DeepSpeed ZeRO Stage 3 Training: The Qwen-Image series models support DeepSpeed ZeRO Stage 3 training, which partitions the model across multiple GPUs. Taking full parameter training of the Qwen-Image model as an example, the following modifications are required:

  • --config_file examples/qwen_image/model_training/full/accelerate_config_zero3.yaml
  • --initialize_model_on_cpu

Model Inference

Models are loaded via QwenImagePipeline.from_pretrained, see Loading Models.

Input parameters for QwenImagePipeline inference include:

  • prompt: Prompt describing the content appearing in the image.
  • negative_prompt: Negative prompt describing content that should not appear in the image, default value is "".
  • cfg_scale: Classifier-free guidance parameter, default value is 4. When set to 1, it no longer takes effect.
  • input_image: Input image for image-to-image generation, used in conjunction with denoising_strength.
  • denoising_strength: Denoising strength, range is 0~1, default value is 1. When the value approaches 0, the generated image is similar to the input image; when the value approaches 1, the generated image differs more from the input image. When input_image parameter is not provided, do not set this to a non-1 value.
  • inpaint_mask: Image inpainting mask image.
  • inpaint_blur_size: Edge blur width for image inpainting.
  • inpaint_blur_sigma: Edge blur strength for image inpainting.
  • height: Image height, must be a multiple of 16.
  • width: Image width, must be a multiple of 16.
  • seed: Random seed. Default is None, meaning completely random.
  • rand_device: Computing device for generating random Gaussian noise matrix, default is "cpu". When set to cuda, different GPUs will produce different generation results.
  • num_inference_steps: Number of inference steps, default value is 30.
  • exponential_shift_mu: Fixed parameter used in sampling timesteps. Leave blank to sample based on image width and height.
  • blockwise_controlnet_inputs: Blockwise ControlNet model inputs.
  • eligen_entity_prompts: EliGen partition control prompts.
  • eligen_entity_masks: EliGen partition control region mask images.
  • eligen_enable_on_negative: Whether to enable EliGen partition control on the negative side of CFG.
  • edit_image: Edit model images to be edited, supports multiple images.
  • edit_image_auto_resize: Whether to automatically scale edit images.
  • edit_rope_interpolation: Whether to enable ROPE interpolation on low-resolution edit images.
  • context_image: In-Context Control input image.
  • tiled: Whether to enable VAE tiling inference, default is False. Setting to True can significantly reduce VRAM usage during VAE encoding/decoding stages, producing slight errors and slightly longer inference time.
  • tile_size: Tile size during VAE encoding/decoding stages, default is 128, only effective when tiled=True.
  • tile_stride: Tile stride during VAE encoding/decoding stages, default is 64, only effective when tiled=True, must be less than or equal to tile_size.
  • progress_bar_cmd: Progress bar, default is tqdm.tqdm. Can be disabled by setting to lambda x:x.

If VRAM is insufficient, please enable VRAM Management. We provide recommended low VRAM configurations for each model in the example code, see the table in the "Model Overview" section above.

Model Training

Qwen-Image series models are uniformly trained through examples/qwen_image/model_training/train.py, and the script parameters include:

  • General Training Parameters
    • Dataset Basic Configuration
      • --dataset_base_path: Root directory of the dataset.
      • --dataset_metadata_path: Metadata file path of the dataset.
      • --dataset_repeat: Number of times the dataset is repeated in each epoch.
      • --dataset_num_workers: Number of processes for each DataLoader.
      • --data_file_keys: Field names to be loaded from metadata, usually image or video file paths, separated by ,.
    • Model Loading Configuration
      • --model_paths: Paths of models to be loaded. JSON format.
      • --model_id_with_origin_paths: Model IDs with original paths, e.g., "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors". Separated by commas.
      • --extra_inputs: Extra input parameters required by the model Pipeline, e.g., extra parameters edit_image when training image editing model Qwen-Image-Edit, separated by ,.
      • --fp8_models: Models loaded in FP8 format, consistent with --model_paths or --model_id_with_origin_paths format. Currently only supports models whose parameters are not updated by gradients (no gradient backpropagation, or gradients only update their LoRA).
    • Training Basic Configuration
      • --learning_rate: Learning rate.
      • --num_epochs: Number of epochs.
      • --trainable_models: Trainable models, e.g., dit, vae, text_encoder.
      • --find_unused_parameters: Whether there are unused parameters in DDP training. Some models contain redundant parameters that do not participate in gradient calculation, and this setting needs to be enabled to avoid errors in multi-GPU training.
      • --weight_decay: Weight decay size, see torch.optim.AdamW.
      • --task: Training task, default is sft. Some models support more training modes, please refer to the documentation of each specific model.
    • Output Configuration
      • --output_path: Model saving path.
      • --remove_prefix_in_ckpt: Remove prefix in the state dict of the model file.
      • --save_steps: Interval of training steps to save the model. If this parameter is left blank, the model is saved once per epoch.
    • LoRA Configuration
      • --lora_base_model: Which model to add LoRA to.
      • --lora_target_modules: Which layers to add LoRA to.
      • --lora_rank: Rank of LoRA.
      • --lora_checkpoint: Path of the LoRA checkpoint. If this path is provided, LoRA will be loaded from this checkpoint.
      • --preset_lora_path: Preset LoRA checkpoint path. If this path is provided, this LoRA will be loaded in the form of being merged into the base model. This parameter is used for LoRA differential training.
      • --preset_lora_model: Model that the preset LoRA is merged into, e.g., dit.
    • Gradient Configuration
      • --use_gradient_checkpointing: Whether to enable gradient checkpointing.
      • --use_gradient_checkpointing_offload: Whether to offload gradient checkpointing to memory.
      • --gradient_accumulation_steps: Number of gradient accumulation steps.
    • Image Width/Height Configuration (Applicable to Image Generation and Video Generation Models)
      • --height: Height of image or video. Leave height and width blank to enable dynamic resolution.
      • --width: Width of image or video. Leave height and width blank to enable dynamic resolution.
      • --max_pixels: Maximum pixel area of image or video frames. When dynamic resolution is enabled, images with resolution larger than this value will be downscaled, and images with resolution smaller than this value will remain unchanged.
  • Qwen-Image Specific Parameters
    • --tokenizer_path: Path of the tokenizer, applicable to text-to-image models, leave blank to automatically download from remote.
    • --processor_path: Path of the processor, applicable to image editing models, leave blank to automatically download from remote.

We have built a sample image dataset for your testing. You can download this dataset with the following command:

modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --local_dir ./data/diffsynth_example_dataset

We have written recommended training scripts for each model, please refer to the table in the "Model Overview" section above. For how to write model training scripts, please refer to Model Training; for more advanced training algorithms, please refer to Training Framework Detailed Explanation.