DiffSynth-Studio/docs/en/Research_Tutorial/inference_time_scaling.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8db54992",
   "metadata": {},
   "source": [
    "# Inference Optimization Techniques\n",
    "\n",
    "DiffSynth-Studio aims to drive technological innovation through its foundational framework. This article demonstrates how to build a training-free image generation enhancement solution using DiffSynth-Studio, taking Inference-time scaling as an example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0911cad4",
   "metadata": {},
   "source": [
    "## 1. Image Quality Quantification\n",
    "\n",
    "First, we need to find an indicator to quantify image quality from generation models. Manual scoring is the most straightforward solution but too costly for large-scale applications. However, after collecting manual scores, training an image classification model to predict human scoring is completely feasible. PickScore [[1]](https://arxiv.org/abs/2305.01569) is such a model. Running the following code will automatically download and load the [PickScore model](https://modelscope.cn/models/AI-ModelScope/PickScore_v1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4faca4ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modelscope import AutoProcessor, AutoModel\n",
    "import torch\n",
    "\n",
    "class PickScore(torch.nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.processor = AutoProcessor.from_pretrained(\"laion/CLIP-ViT-H-14-laion2B-s32B-b79K\")\n",
    "        self.model = AutoModel.from_pretrained(\"AI-ModelScope/PickScore_v1\").eval().to(\"cuda\")\n",
    "\n",
    "    def forward(self, image, prompt):\n",
    "        image_inputs = self.processor(images=image, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
    "        text_inputs = self.processor(text=prompt, padding=True, truncation=True, max_length=77, return_tensors=\"pt\").to(\"cuda\")\n",
    "        with torch.inference_mode():\n",
    "            image_embs = self.model.get_image_features(**image_inputs).pooler_output\n",
    "            image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)\n",
    "            text_embs = self.model.get_text_features(**text_inputs).pooler_output\n",
    "            text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)\n",
    "            score = (text_embs @ image_embs.T).flatten().item()\n",
    "        return score\n",
    "\n",
    "reward_model = PickScore()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f807cec",
   "metadata": {},
   "source": [
    "## 2. Inference-time Scaling Techniques\n",
    "\n",
    "Inference-time Scaling [[2]](https://arxiv.org/abs/2504.00294) is an interesting technique aiming to improve generation quality by increasing computational costs during inference. For example, in language models, models like [Qwen/Qwen3.5-27B](https://modelscope.cn/models/Qwen/Qwen3.5-27B) and [deepseek-ai/DeepSeek-R1](deepseek-ai/DeepSeek-R1) use \"thinking mode\" to guide the model to spend more time considering results more carefully, producing more accurate answers. Next, we'll use the [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) model as an example to explore how to design Inference-time Scaling solutions for image generation models.\n",
    "\n",
    "> Before starting, we slightly modified the `Flux2ImagePipeline` code to allow initialization with specific Gaussian noise matrices for result reproducibility. See `Flux2Unit_NoiseInitializer` in [diffsynth/pipelines/flux2_image.py](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/pipelines/flux2_image.py).\n",
    "\n",
    "Run the following code to load the [black-forest-labs/FLUX.2-klein-4B](https://modelscope.cn/models/black-forest-labs/FLUX.2-klein-4B) model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5818a87",
   "metadata": {},
   "outputs": [],
   "source": [
    "from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig\n",
    "\n",
    "pipe = Flux2ImagePipeline.from_pretrained(\n",
    "    torch_dtype=torch.bfloat16,\n",
    "    device=\"cuda\",\n",
    "    model_configs=[\n",
    "        ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"text_encoder/*.safetensors\"),\n",
    "        ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"transformer/*.safetensors\"),\n",
    "        ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"vae/diffusion_pytorch_model.safetensors\"),\n",
    "    ],\n",
    "    tokenizer_config=ModelConfig(model_id=\"black-forest-labs/FLUX.2-klein-4B\", origin_file_pattern=\"tokenizer/\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f58e9945",
   "metadata": {},
   "source": [
    "Generate a sketch cat image using the prompt `\"sketch, a cat\"` and score it with the PickScore model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ea2d258",
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_noise(noise, pipe, reward_model, prompt):\n",
    "    # Generate an image and compute the score.\n",
    "    image = pipe(\n",
    "        prompt=prompt,\n",
    "        num_inference_steps=4,\n",
    "        initial_noise=noise,\n",
    "        progress_bar_cmd=lambda x: x,\n",
    "    )\n",
    "    score = reward_model(image, prompt)\n",
    "    return score\n",
    "\n",
    "torch.manual_seed(1)\n",
    "prompt = \"sketch, a cat\"\n",
    "noise = pipe.generate_noise((1, 128, 64, 64), rand_device=\"cuda\", rand_torch_dtype=pipe.torch_dtype)\n",
    "\n",
    "image_1 = pipe(prompt, num_inference_steps=4, initial_noise=noise)\n",
    "print(\"Score:\", reward_model(image_1, prompt))\n",
    "image_1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e11694e",
   "metadata": {},
   "source": [
    "### 2.1 Best-of-N Random Search\n",
    "\n",
    "Model generation results have inherent randomness. Different random seeds produce different images - sometimes high quality, sometimes low. This leads to a simple Inference-time scaling solution: generate images using multiple random seeds, score them with PickScore, and retain only the highest-scoring image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "241f10d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "def random_search(base_latents, objective_reward_fn, total_eval_budget):\n",
    "    # Search for the noise randomly.\n",
    "    best_noise = base_latents\n",
    "    best_score = objective_reward_fn(base_latents)\n",
    "    for it in tqdm(range(total_eval_budget - 1)):\n",
    "        noise = pipe.generate_noise((1, 128, 64, 64), seed=None)\n",
    "        score = objective_reward_fn(noise)\n",
    "        if score > best_score:\n",
    "            best_score, best_noise = score, noise\n",
    "    return best_noise\n",
    "\n",
    "best_noise = random_search(\n",
    "    base_latents=noise,\n",
    "    objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
    "    total_eval_budget=50,\n",
    ")\n",
    "image_2 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
    "print(\"Score:\", reward_model(image_2, prompt))\n",
    "image_2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e9bf966",
   "metadata": {},
   "source": [
    "We can clearly see that after multiple random searches, the final selected cat image shows richer fur details and significantly improved PickScore. However, this brute-force random search is extremely inefficient - generation time multiplies while easily hitting quality limits. Therefore, we need a more efficient search method that achieves higher scores within the same computational budget."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9578349",
   "metadata": {},
   "source": [
    "### 2.2 SES Search\n",
    "\n",
    "To overcome random search limitations, we introduce the Spectral Evolution Search (SES) algorithm [[3]](https://arxiv.org/abs/2602.03208). Detailed code is available at [diffsynth/utils/ses](https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/utils/ses).\n",
    "\n",
    "Image generation in diffusion models is largely determined by low-frequency components in the initial noise. The SES algorithm decomposes Gaussian noise through wavelet transforms, fixes high-frequency details, and applies an evolution search using the cross-entropy method specifically on low-frequency components to find optimal initial noise with higher efficiency.\n",
    "\n",
    "Run the following code to perform efficient best Gaussian noise matrix search using SES."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adeed2aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from diffsynth.utils.ses import ses_search\n",
    "\n",
    "best_noise = ses_search(\n",
    "    base_latents=noise,\n",
    "    objective_reward_fn=lambda noise: evaluate_noise(noise, pipe, reward_model, prompt),\n",
    "    total_eval_budget=50,\n",
    ")\n",
    "image_3 = pipe(prompt, num_inference_steps=4, initial_noise=best_noise)\n",
    "print(\"Score:\", reward_model(image_3, prompt))\n",
    "image_3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "940a97f1",
   "metadata": {},
   "source": [
    "Observing the results, under the same computational budget, SES achieves significantly higher PickScore compared to random search. The \"sketch cat\" demonstrates more refined overall composition and more layered contrast between light and shadow.\n",
    "\n",
    "Inference-time scaling can achieve higher image quality at the cost of longer inference time. The generated image data can then be used to train the model itself through methods like DPO [[4]](https://arxiv.org/abs/2311.12908) or differential training [[5]](https://arxiv.org/abs/2412.12888), opening another interesting research direction."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dzj8",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}