7.0 KiB
diffsynth.core.data: Data Processing Operators and Universal Dataset
Data Processing Operators
Available Data Processing Operators
diffsynth.core.data provides a series of data processing operators for data processing, including:
- Data format conversion operators
ToInt: Convert to int formatToFloat: Convert to float formatToStr: Convert to str formatToList: Convert to list format, wrapping this data in a listToAbsolutePath: Convert relative paths to absolute paths
- File loading operators
LoadImage: Read image filesLoadVideo: Read video filesLoadAudio: Read audio filesLoadGIF: Read GIF filesLoadTorchPickle: Read binary files saved bytorch.save[This operator may cause code injection attacks in binary files, please use with caution!]
- Media file processing operators
ImageCropAndResize: Crop and resize images
- Meta operators
SequencialProcess: Route each data in the sequence to an operatorRouteByExtensionName: Route to specific operators by file extensionRouteByType: Route to specific operators by data type
Operator Usage
Data operators are connected with the >> symbol to form data processing pipelines, for example:
from diffsynth.core.data.operators import *
data = "image.jpg"
data_pipeline = ToAbsolutePath(base_path="/data") >> LoadImage() >> ImageCropAndResize(max_pixels=512*512)
data = data_pipeline(data)
After passing through each operator, the data is processed in sequence:
ToAbsolutePath(base_path="/data"):"/data/image.jpg"LoadImage():<PIL.Image.Image image mode=RGB size=1024x1024 at 0x7F8E7AAEFC10>ImageCropAndResize(max_pixels=512*512):<PIL.Image.Image image mode=RGB size=512x512 at 0x7F8E7A936F20>
We can compose functionally complete data pipelines, for example, the default video data operator for the universal dataset is:
RouteByType(operator_map=[
(str, ToAbsolutePath(base_path) >> RouteByExtensionName(operator_map=[
(("jpg", "jpeg", "png", "webp"), LoadImage() >> ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor) >> ToList()),
(("gif",), LoadGIF(
num_frames, time_division_factor, time_division_remainder,
frame_processor=ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor),
)),
(("mp4", "avi", "mov", "wmv", "mkv", "flv", "webm"), LoadVideo(
num_frames, time_division_factor, time_division_remainder,
frame_processor=ImageCropAndResize(height, width, max_pixels, height_division_factor, width_division_factor),
)),
])),
])
It includes the following logic:
- If the data is of type
str- If it's a
"jpg", "jpeg", "png", "webp"type file- Load this image
- Crop and scale to a specific resolution
- Pack into a list, treating it as a single-frame video
- If it's a
"gif"type file- Load the GIF file content
- Crop and scale each frame to a specific resolution
- If it's a
"mp4", "avi", "mov", "wmv", "mkv", "flv", "webm"type file- Load the video file content
- Crop and scale each frame to a specific resolution
- If it's a
- If the data is not of type
str, an error is reported
Universal Dataset
diffsynth.core.data provides a unified dataset implementation. The dataset requires the following parameters:
base_path: Root directory. If the dataset contains relative paths to image files, this field needs to be filled in to load the files pointed to by these pathsmetadata_path: Metadata directory, records the file paths of all metadata, supportscsv,json,jsonlformatsrepeat: Data repetition count, defaults to 1, this parameter affects the number of training steps in an epochdata_file_keys: Data field names that need to be loaded, for example(image, edit_image)main_data_operator: Main loading operator, needs to assemble the data processing pipeline through data processing operatorsspecial_operator_map: Special operator mapping, operator mappings built for fields that require special processing
Metadata
The dataset's metadata_path points to a metadata file, supporting csv, json, jsonl formats. The following provides examples:
csvformat: High readability, does not support list data, small memory footprint
image,prompt
image_1.jpg,"a dog"
image_2.jpg,"a cat"
jsonformat: High readability, supports list data, large memory footprint
[
{
"image": "image_1.jpg",
"prompt": "a dog"
},
{
"image": "image_2.jpg",
"prompt": "a cat"
}
]
jsonlformat: Low readability, supports list data, small memory footprint
{"image": "image_1.jpg", "prompt": "a dog"}
{"image": "image_2.jpg", "prompt": "a cat"}
How to choose the best metadata format?
- If the data volume is large, reaching tens of millions, since
jsonfile parsing requires additional memory, it's not available. Please usecsvorjsonlformat - If the dataset contains list data, such as edit models that require multiple images as input, since
csvformat cannot store list format data, it's not available. Please usejsonorjsonlformat
Data Loading Logic
When no additional settings are made, the dataset defaults to outputting data from the metadata set. Image and video file paths will be output in string format. To load these files, you need to set data_file_keys, main_data_operator, and special_operator_map.
In the data processing flow, processing is done according to the following logic:
- If the field is in
special_operator_map, call the corresponding operator inspecial_operator_mapfor processing - If the field is not in
special_operator_map- If the field is in
data_file_keys, call themain_data_operatoroperator for processing - If the field is not in
data_file_keys, no processing is done
- If the field is in
special_operator_map can be used to implement special data processing. For example, in the model Wan-AI/Wan2.2-Animate-14B, the input character face video animate_face_video is processed at a fixed resolution, inconsistent with the output video. Therefore, this field is processed by a dedicated operator:
special_operator_map={
"animate_face_video": ToAbsolutePath(args.dataset_base_path) >> LoadVideo(args.num_frames, 4, 1, frame_processor=ImageCropAndResize(512, 512, None, 16, 16)),
}
Other Notes
When the data volume is too small, you can appropriately increase repeat to extend the training time of a single epoch, avoiding frequent model saving that generates considerable overhead.
When data volume * repeat exceeds 10^9, we observe that the dataset speed becomes significantly slower. This seems to be a PyTorch bug, and we are not sure if newer versions of PyTorch have fixed this issue.