OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

10citations

arXiv:2503.22952

citations

#822

in CVPR 2025

of 2873 papers

Top Authors

Data Points

Top Authors

Yuxuan Wang Yueqian Wang Bo Chen Tong Wu Dongyan Zhao Zilong Zheng

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026

Feb 13, 2026

10+10

Feb 13, 2026