Files
DiffSynth-Studio/docs/zh/Research_Tutorial/inference_time_scaling.ipynb
Zhongjie Duan 31ba103d8e Merge pull request #1330 from modelscope/ses-doc
Research Tutorial Sec 2
2026-03-06 14:25:45 +08:00

237 lines
10 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "8db54992",
"metadata": {},
"source": [
"# 推理改进优化技术\n",
"\n",
"DiffSynth-Studio 旨在以基础框架驱动技术创新。本文以 Inference-time scaling 为例,展示如何基于 DiffSynth-Studio 构建免训练Training-free的图像生成增强方案。"
]
},
{
"cell_type": "markdown",
"id": "0911cad4",
"metadata": {},
"source": [
"## 1. 图像质量量化\n",
"\n",
"首先我们需要找到一个指标来量化图像生成模型生成的图像质量。最简单直接的方案是人工打分但这样做的成本太高无法大规模使用。不过收集人工打分后训练一个图像分类模型来预测人类的打分结果是完全可行的。PickScore [[1]](https://arxiv.org/abs/2305.01569) 就是这样一个模型,运行下面的代码,将会自动下载并加载 [PickScore 模型](https://modelscope.cn/models/AI-ModelScope/PickScore_v1)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4faca4ca",
"metadata": {},
"outputs": [],
"source": [
"from modelscope import AutoProcessor, AutoModel\n",
"import torch\n",
"\n",
"class PickScore(torch.nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.processor = AutoProcessor.from_pretrained(\"laion/CLIP-ViT-H-14-laion2B-s32B-b79K\")\n",
" self.model = AutoModel.from_pretrained(\"AI-ModelScope/PickScore_v1\").eval().to(\"cuda\")\n",
"\n",
" def forward(self, image, prompt):\n",
" image_inputs = self.processor(images=image, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
" text_inputs = self.processor(text=prompt, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
" with torch.inference_mode():\n",
" image_embs = self.model.get_image_features(**image_inputs).pooler_output\n",
" image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)\n",
" text_embs = self.model.get_text_features(**text_inputs).pooler_output\n",
" text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)\n",
" score = (text_embs @ image_embs.T).flatten().item()\n",
" return score\n",
"\n",
"reward_model = PickScore()"
]
},
{
"cell_type": "markdown",
"id": "5f807cec",
"metadata": {},
"source": [
"## 2. Inference-time Scaling 技术\n",
"\n",
"Inference-time Scaling [[2]](https://arxiv.org/abs/2504.00294) 是一类有趣的技术,旨在通过增加推理时的计算量来提升生成结果的质量。例如,在语言模型中,[Qwen/Qwen3.5-27B](https://modelscope.cn/models/Qwen/Qwen3.5-27B)、[deepseek-ai/DeepSeek-R1](deepseek-ai/DeepSeek-R1) 等模型通过“思考模式”引导模型花更多时间仔细思考,让回答结果更准确。接下来我们以模型 [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) 为例,探讨如何为图像生成模型设计 Inference-time Scaling 方案。\n",
"\n",
"> 在开始前,我们稍微改造了 `Flux2ImagePipeline` 的代码,使其能够根据输入的特定高斯噪声矩阵进行初始化,便于复现结果,详见 [diffsynth/pipelines/flux2_image.py](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/pipelines/flux2_image.py) 中的 `Flux2Unit_NoiseInitializer`。\n",
"\n",
"运行以下代码,加载模型 [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5818a87",
"metadata": {},
"outputs": [],
"source": [
"from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig\n",
"\n",
"pipe = Flux2ImagePipeline.from_pretrained(\n",
" torch_dtype=torch.bfloat16,\n",
" device=\"cuda\",\n",
" model_configs=[\n",
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"text_encoder/*.safetensors\"),\n",
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"transformer/*.safetensors\"),\n",
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"vae/diffusion_pytorch_model.safetensors\"),\n",
" ],\n",
" tokenizer_config=ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"tokenizer/\"),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f58e9945",
"metadata": {},
"source": [
"用提示词 `\"sketch, a cat\"` 生成一只素描猫猫,并用 PickScore 模型打分。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ea2d258",
"metadata": {},
"outputs": [],
"source": [
"def evaluate_noise(noise, pipe, reward_model, prompt):\n",
" # Generate an image and compute the score.\n",
" image = pipe(\n",
" prompt=prompt,\n",
" num_inference_steps=4,\n",
" initial_noise=noise,\n",
" progress_bar_cmd=lambda x: x,\n",
" )\n",
" score = reward_model(image, prompt)\n",
" return score\n",
"\n",
"torch.manual_seed(1)\n",
"prompt = \"sketch, a cat\"\n",
"noise = pipe.generate_noise((1, 128, 64, 64), rand_device=\"cuda\", rand_torch_dtype=pipe.torch_dtype)\n",
"\n",
"image_1 = pipe(prompt, num_inference_steps=4, initial_noise=noise)\n",
"print(\"Score:\", reward_model(image_1, prompt))\n",
"image_1"
]
},
{
"cell_type": "markdown",
"id": "5e11694e",
"metadata": {},
"source": [
"### 2.1 Best-of-N 随机搜索\n",
"\n",
"模型的生成结果具有一定的随机性,如果用不同的随机种子,生成的图像结果也是不同的,有时图像质量高,有时图像质量低。那么,我们有一个简单的 Inference-time scaling 方案:使用多个不同的随机种子分别生成图像,然后利用 PickScore 进行打分,只保留分数最高的那一张。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "241f10d2",
"metadata": {},
"outputs": [],
"source": [
"from tqdm import tqdm\n",
"\n",
"def random_search(base_latents, objective_reward_fn, total_eval_budget):\n",
" # Search for the noise randomly.\n",
" best_noise = base_latents\n",
" best_score = objective_reward_fn(base_latents)\n",
" for it in tqdm(range(total_eval_budget - 1)):\n",
" noise = pipe.generate_noise((1, 128, 64, 64), seed=None)\n",
" score = objective_reward_fn(noise)\n",
" if score > best_score:\n",
" best_score, best_noise = score, noise\n",
" return best_noise\n",
"\n",
"best_noise = random_search(\n",
" base_latents=noise,\n",
" objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
" total_eval_budget=50,\n",
")\n",
"image_2 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
"print(\"Score:\", reward_model(image_2, prompt))\n",
"image_2"
]
},
{
"cell_type": "markdown",
"id": "8e9bf966",
"metadata": {},
"source": [
"我们可以清晰地看到经过多次随机搜索后最终选出的猫猫毛发细节更加丰富PickScore 分数也有明显提升。但这种暴力的随机搜索效率极低,生成时间成倍增长,且很容易触及质量上限。因此,我们希望能够找到一种更高效的搜索方法,在同等计算预算下达到更高的分数。"
]
},
{
"cell_type": "markdown",
"id": "c9578349",
"metadata": {},
"source": [
"### 2.2 SES 搜索\n",
"\n",
"为了突破随机搜索的瓶颈,我们引入了 SES (Spectral Evolution Search) 算法 [[3]](https://arxiv.org/abs/2602.03208),详细的代码位于 [diffsynth/utils/ses](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/utils/ses)。\n",
"\n",
"扩散模型生成的图像很大程度上由初始噪声的低频分量决定。SES 算法通过小波变换将高斯噪声分解,固定高频细节,专门针对低频部分使用交叉熵方法进行演化搜索,能以更高的效率找到优质的初始噪声。\n",
"\n",
"运行下面的代码,即可使用 SES 更高效地搜索最佳的高斯噪声矩阵。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adeed2aa",
"metadata": {},
"outputs": [],
"source": [
"from diffsynth.utils.ses import ses_search\n",
"\n",
"best_noise = ses_search(\n",
" base_latents=noise,\n",
" objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
" total_eval_budget=50,\n",
")\n",
"image_3 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
"print(\"Score:\", reward_model(image_3, prompt))\n",
"image_3"
]
},
{
"cell_type": "markdown",
"id": "940a97f1",
"metadata": {},
"source": [
"可以观察到在同样的计算预算下相比于随机搜索SES 的结果在 PickScore 得分上取得了显著的提升。“素描猫猫”展现出了更精致的整体构图以及更具层次感的明暗对比。\n",
"\n",
"Inference-time scaling 能够以更长推理时间为代价获得更高的图像质量,那么它生成的图像数据也可以用 DPO [[4]](https://arxiv.org/abs/2311.12908)、差分训练 [[5]](https://arxiv.org/abs/2412.12888) 等方式赋予模型自身,那就是另外一个有趣的探索方向了。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dzj8",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.19"
}
},
"nbformat": 4,
"nbformat_minor": 5
}