OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

20citations

arXiv:2411.18499

citations

#401

in CVPR 2025

of 2873 papers

Top Authors

Data Points

Top Authors

Pengfei Zhou Xiaopeng Peng Jiajun Song Chuanhao Li Zhaopan Xu Yue Yang Ziyao Guo Hao Zhang Yuqi Lin Yefei He Lirui Zhao Shuo Liu Tianhua Li Yuxuan Xie Xiaojun Chang Yu Qiao Wenqi Shao Kaipeng Zhang

Topics

interleaved image-text generation multimodal large language models benchmark evaluation open-ended generation multimodal understanding judge model real-world tasks

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to limitations in data size and diversity. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 3, 2026

19+1

Feb 13, 2026

20+1

Feb 13, 2026