Qwen-Image FP8 (#761)

* support qwen-image-fp8 * refine README * bugfix * bugfix
2026-03-23 09:28:12 +00:00 · 2025-08-07 16:56:02 +08:00
parent 4f7c3b6a1e
commit 32cf5d32ce
7 changed files with 225 additions and 39 deletions
--- a/examples/qwen_image/README.md
+++ b/examples/qwen_image/README.md
@@ -164,6 +164,7 @@ After enabling VRAM management, the framework will automatically choose a memory
 * `vram_limit`: VRAM usage limit in GB. By default, it uses all free VRAM on the device. Note that this is not a strict limit. If the set limit is too low but actual free VRAM is enough, the model will run with minimal VRAM use. Set it to 0 for the smallest possible VRAM use.
 * `vram_buffer`: VRAM buffer size in GB. Default is 0.5GB. A buffer is needed because large network layers may use more VRAM than expected during loading. The best value is the VRAM size of the largest model layer.
 * `num_persistent_param_in_dit`: Number of parameters to keep in VRAM in the DiT model. Default is no limit. This option will be removed in the future. Do not rely on it.
+* `enable_dit_fp8_computation`: Whether to enable FP8 computation in the DiT model. This is only applicable to GPUs that support FP8 operations (e.g., H200, etc.). Disabled by default.

 </details>

@@ -172,7 +173,11 @@ After enabling VRAM management, the framework will automatically choose a memory

 <summary>Inference Acceleration</summary>

-Inference acceleration for Qwen-Image is under development. Please stay tuned!
+* FP8 Quantization: Choose the appropriate quantization method based on your hardware and requirements.
+    * GPUs that do not support FP8 computation (e.g., A100, 4090, etc.): FP8 quantization will only reduce VRAM usage without speeding up inference. Code: [./model_inference_lor_vram/Qwen-Image.py](./model_inference_lor_vram/Qwen-Image.py)
+    * GPUs that support FP8 operations (e.g., H200, etc.): Please install [Flash Attention 3](https://github.com/Dao-AILab/flash-attention). Otherwise, FP8 acceleration will only apply to Linear layers.
+        * Faster inference but higher VRAM usage: Use [./accelerate/Qwen-Image-FP8.py](./accelerate/Qwen-Image-FP8.py)
+        * Slightly slower inference but lower VRAM usage: Use [./accelerate/Qwen-Image-FP8-offload.py](./accelerate/Qwen-Image-FP8-offload.py)

 </details>