The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

16citations

arXiv:2312.12870

citations

#1438

in CVPR 2024

of 2716 papers

Top Authors

Data Points

Top Authors

Wenqi Jia Miao Liu Hao Jiang Ishwarya Ananthabhotla James Rehg Vamsi Krishna Ithapu Ruohan Gao

Topics

egocentric video analysis conversational interaction prediction audio-visual fusion multi-modal attention exocentric behavior inference multi-speaker scenarios conversational graph prediction

Abstract

In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our project page at https://vjwq.github.io/AV-CONV/.

Citation History

Jan 27, 2026

Feb 13, 2026

16+1

Feb 13, 2026