update lora loading in docs

2026-03-23 17:38:10 +00:00 · 2026-02-10 10:48:44 +08:00
parent dc94614c80
commit ff10fde47f
4 changed files with 138 additions and 2 deletions
--- a/docs/en/QA.md
+++ b/docs/en/QA.md
@@ -25,4 +25,11 @@ Even with suitable hardware conditions, we currently have no plans to support na
 * The main challenge of native FP8 precision training is precision overflow caused by gradient explosion. To ensure training stability, the model structure needs to be redesigned accordingly. However, no model developers are willing to do so at present.
 * Additionally, models trained with native FP8 precision can only be computed with BF16 precision during inference without Hopper architecture GPUs, theoretically resulting in generation quality inferior to FP8.

-Therefore, native FP8 precision training technology is extremely immature. We will observe the technological developments in the open-source community.
+Therefore, native FP8 precision training technology is extremely immature. We will observe the technological developments in the open-source community.
+
+## How to dynamically load LoRA models during inference?
+
+We support two loading methods for LoRA models. See [LoRA Loading](/docs/en/Pipeline_Usage/Model_Inference.md#loading-lora) for details:
+
+* Cold Loading: When [VRAM Management](/docs/en/Pipeline_Usage/VRAM_management.md) is not enabled for the base model, LoRA will be fused into the base model weights. In this case, inference speed remains unchanged, and LoRA cannot be unloaded after loading.
+* Hot Loading: When [VRAM Management](/docs/en/Pipeline_Usage/VRAM_management.md) is enabled for the base model, LoRA will not be fused into the base model weights. In this case, inference speed will slow down, and LoRA can be unloaded after loading via `pipe.clear_lora()`.