Generating Multi-Image Synthetic Data for Text-to-Image Customization

15citations

arXiv:2502.01720

citations

#243

in ICCV 2025

of 2701 papers

Top Authors

Data Points

Top Authors

Nupur Kumari Xi Yin Jun-Yan Zhu Ishan Misra Samaneh Azadi

Topics

text-to-image customization synthetic data generation multi-image supervision encoder-based models shared attention mechanism inference normalization 3d datasets

Abstract

Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.

Citation History

Jan 24, 2026

Jan 27, 2026

Feb 3, 2026

15+1

Feb 13, 2026