mirror of
https://github.com/modelscope/DiffSynth-Studio.git
synced 2026-03-19 06:48:12 +00:00
237 lines
10 KiB
Plaintext
237 lines
10 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8db54992",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 推理改进优化技术\n",
|
||
"\n",
|
||
"DiffSynth-Studio 旨在以基础框架驱动技术创新。本文以 Inference-time scaling 为例,展示如何基于 DiffSynth-Studio 构建免训练(Training-free)的图像生成增强方案。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0911cad4",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1. 图像质量量化\n",
|
||
"\n",
|
||
"首先,我们需要找到一个指标来量化图像生成模型生成的图像质量。最简单直接的方案是人工打分,但这样做的成本太高,无法大规模使用。不过,收集人工打分后,训练一个图像分类模型来预测人类的打分结果,是完全可行的。PickScore [[1]](https://arxiv.org/abs/2305.01569) 就是这样一个模型,运行下面的代码,将会自动下载并加载 [PickScore 模型](https://modelscope.cn/models/AI-ModelScope/PickScore_v1)。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4faca4ca",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from modelscope import AutoProcessor, AutoModel\n",
|
||
"import torch\n",
|
||
"\n",
|
||
"class PickScore(torch.nn.Module):\n",
|
||
" def __init__(self):\n",
|
||
" super().__init__()\n",
|
||
" self.processor = AutoProcessor.from_pretrained(\"laion/CLIP-ViT-H-14-laion2B-s32B-b79K\")\n",
|
||
" self.model = AutoModel.from_pretrained(\"AI-ModelScope/PickScore_v1\").eval().to(\"cuda\")\n",
|
||
"\n",
|
||
" def forward(self, image, prompt):\n",
|
||
" image_inputs = self.processor(images=image, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
|
||
" text_inputs = self.processor(text=prompt, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
|
||
" with torch.inference_mode():\n",
|
||
" image_embs = self.model.get_image_features(**image_inputs).pooler_output\n",
|
||
" image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)\n",
|
||
" text_embs = self.model.get_text_features(**text_inputs).pooler_output\n",
|
||
" text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)\n",
|
||
" score = (text_embs @ image_embs.T).flatten().item()\n",
|
||
" return score\n",
|
||
"\n",
|
||
"reward_model = PickScore()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5f807cec",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2. Inference-time Scaling 技术\n",
|
||
"\n",
|
||
"Inference-time Scaling [[2]](https://arxiv.org/abs/2504.00294) 是一类有趣的技术,旨在通过增加推理时的计算量来提升生成结果的质量。例如,在语言模型中,[Qwen/Qwen3.5-27B](https://modelscope.cn/models/Qwen/Qwen3.5-27B)、[deepseek-ai/DeepSeek-R1](deepseek-ai/DeepSeek-R1) 等模型通过“思考模式”引导模型花更多时间仔细思考,让回答结果更准确。接下来我们以模型 [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) 为例,探讨如何为图像生成模型设计 Inference-time Scaling 方案。\n",
|
||
"\n",
|
||
"> 在开始前,我们稍微改造了 `Flux2ImagePipeline` 的代码,使其能够根据输入的特定高斯噪声矩阵进行初始化,便于复现结果,详见 [diffsynth/pipelines/flux2_image.py](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/pipelines/flux2_image.py) 中的 `Flux2Unit_NoiseInitializer`。\n",
|
||
"\n",
|
||
"运行以下代码,加载模型 [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B)。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c5818a87",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig\n",
|
||
"\n",
|
||
"pipe = Flux2ImagePipeline.from_pretrained(\n",
|
||
" torch_dtype=torch.bfloat16,\n",
|
||
" device=\"cuda\",\n",
|
||
" model_configs=[\n",
|
||
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"text_encoder/*.safetensors\"),\n",
|
||
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"transformer/*.safetensors\"),\n",
|
||
" ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"vae/diffusion_pytorch_model.safetensors\"),\n",
|
||
" ],\n",
|
||
" tokenizer_config=ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"tokenizer/\"),\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f58e9945",
|
||
"metadata": {},
|
||
"source": [
|
||
"用提示词 `\"sketch, a cat\"` 生成一只素描猫猫,并用 PickScore 模型打分。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6ea2d258",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def evaluate_noise(noise, pipe, reward_model, prompt):\n",
|
||
" # Generate an image and compute the score.\n",
|
||
" image = pipe(\n",
|
||
" prompt=prompt,\n",
|
||
" num_inference_steps=4,\n",
|
||
" initial_noise=noise,\n",
|
||
" progress_bar_cmd=lambda x: x,\n",
|
||
" )\n",
|
||
" score = reward_model(image, prompt)\n",
|
||
" return score\n",
|
||
"\n",
|
||
"torch.manual_seed(1)\n",
|
||
"prompt = \"sketch, a cat\"\n",
|
||
"noise = pipe.generate_noise((1, 128, 64, 64), rand_device=\"cuda\", rand_torch_dtype=pipe.torch_dtype)\n",
|
||
"\n",
|
||
"image_1 = pipe(prompt, num_inference_steps=4, initial_noise=noise)\n",
|
||
"print(\"Score:\", reward_model(image_1, prompt))\n",
|
||
"image_1"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5e11694e",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.1 Best-of-N 随机搜索\n",
|
||
"\n",
|
||
"模型的生成结果具有一定的随机性,如果用不同的随机种子,生成的图像结果也是不同的,有时图像质量高,有时图像质量低。那么,我们有一个简单的 Inference-time scaling 方案:使用多个不同的随机种子分别生成图像,然后利用 PickScore 进行打分,只保留分数最高的那一张。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "241f10d2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from tqdm import tqdm\n",
|
||
"\n",
|
||
"def random_search(base_latents, objective_reward_fn, total_eval_budget):\n",
|
||
" # Search for the noise randomly.\n",
|
||
" best_noise = base_latents\n",
|
||
" best_score = objective_reward_fn(base_latents)\n",
|
||
" for it in tqdm(range(total_eval_budget - 1)):\n",
|
||
" noise = pipe.generate_noise((1, 128, 64, 64), seed=None)\n",
|
||
" score = objective_reward_fn(noise)\n",
|
||
" if score > best_score:\n",
|
||
" best_score, best_noise = score, noise\n",
|
||
" return best_noise\n",
|
||
"\n",
|
||
"best_noise = random_search(\n",
|
||
" base_latents=noise,\n",
|
||
" objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
|
||
" total_eval_budget=50,\n",
|
||
")\n",
|
||
"image_2 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
|
||
"print(\"Score:\", reward_model(image_2, prompt))\n",
|
||
"image_2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8e9bf966",
|
||
"metadata": {},
|
||
"source": [
|
||
"我们可以清晰地看到,经过多次随机搜索后,最终选出的猫猫毛发细节更加丰富,PickScore 分数也有明显提升。但这种暴力的随机搜索效率极低,生成时间成倍增长,且很容易触及质量上限。因此,我们希望能够找到一种更高效的搜索方法,在同等计算预算下达到更高的分数。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c9578349",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.2 SES 搜索\n",
|
||
"\n",
|
||
"为了突破随机搜索的瓶颈,我们引入了 SES (Spectral Evolution Search) 算法 [[3]](https://arxiv.org/abs/2602.03208),详细的代码位于 [diffsynth/utils/ses](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/utils/ses)。\n",
|
||
"\n",
|
||
"扩散模型生成的图像,很大程度上由初始噪声的低频分量决定。SES 算法通过小波变换将高斯噪声分解,固定高频细节,专门针对低频部分使用交叉熵方法进行演化搜索,能以更高的效率找到优质的初始噪声。\n",
|
||
"\n",
|
||
"运行下面的代码,即可使用 SES 更高效地搜索最佳的高斯噪声矩阵。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "adeed2aa",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from diffsynth.utils.ses import ses_search\n",
|
||
"\n",
|
||
"best_noise = ses_search(\n",
|
||
" base_latents=noise,\n",
|
||
" objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
|
||
" total_eval_budget=50,\n",
|
||
")\n",
|
||
"image_3 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
|
||
"print(\"Score:\", reward_model(image_3, prompt))\n",
|
||
"image_3"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "940a97f1",
|
||
"metadata": {},
|
||
"source": [
|
||
"可以观察到,在同样的计算预算下,相比于随机搜索,SES 的结果在 PickScore 得分上取得了显著的提升。“素描猫猫”展现出了更精致的整体构图以及更具层次感的明暗对比。\n",
|
||
"\n",
|
||
"Inference-time scaling 能够以更长推理时间为代价获得更高的图像质量,那么它生成的图像数据也可以用 DPO [[4]](https://arxiv.org/abs/2311.12908)、差分训练 [[5]](https://arxiv.org/abs/2412.12888) 等方式赋予模型自身,那就是另外一个有趣的探索方向了。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "dzj8",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.19"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|