Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

2citations

arXiv:2504.16427 Project

citations

#1951

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Hanlei Zhang zhuohang li Hua Xu Yeshuang Zhu Peiwu Wang Haige Zhu Jie Zhou Jinchao Zhang

Topics

multimodal language analysis large language models multimodal semantics instruction tuning zero-shot inference supervised fine-tuning cognitive-level semantics multimodal utterances

Abstract

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

Citation History

Jan 26, 2026

Feb 1, 2026

Feb 6, 2026

Feb 13, 2026