MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

41citations

arXiv:2404.05674 PDF Project

citations

#308

in ECCV 2024

of 2387 papers

Top Authors

Data Points

Top Authors

KUNPENG SONG Yizhe Zhu Bingchen Liu Qing Yan Ahmed Elgammal Xiao Yang

Abstract

In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. We commit to making our work open-source, thereby providing universal access to these advancements.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026

Feb 13, 2026

41+41

Feb 13, 2026