support qwen-image-fp8

2026-03-24 10:18:12 +00:00 · 2025-08-07 16:20:50 +08:00
parent d20680baae
commit a0eec8c673
7 changed files with 224 additions and 40 deletions
--- a/examples/qwen_image/README.md
+++ b/examples/qwen_image/README.md
@@ -172,7 +172,11 @@ After enabling VRAM management, the framework will automatically choose a memory

 <summary>Inference Acceleration</summary>

-Inference acceleration for Qwen-Image is under development. Please stay tuned!
+* FP8 Quantization: Choose the appropriate quantization method based on your hardware and requirements.
+    * GPUs that do not support FP8 computation (e.g., A100, 4090, etc.): FP8 quantization will only reduce VRAM usage without speeding up inference. Code: [./model_inference_lor_vram/Qwen-Image.py](./model_inference_lor_vram/Qwen-Image.py)
+    * GPUs that support FP8 operations (e.g., H200, etc.): Please install [Flash Attention 3](https://github.com/Dao-AILab/flash-attention). Otherwise, FP8 acceleration will only apply to Linear layers.
+        * Faster inference but higher VRAM usage: Use [./accelerate/Qwen-Image-FP8.py](./accelerate/Qwen-Image-FP8.py)
+        * Slightly slower inference but lower VRAM usage: Use [./accelerate/Qwen-Image-FP8-offload.py](./accelerate/Qwen-Image-FP8-offload.py)

 </details>