Open-Vocabulary Video Relation Extraction

2citations

arXiv:2312.15670

citations

#1499

in AAAI 2024

of 2289 papers

Top Authors

Data Points

Top Authors

Topics

open-vocabulary understanding video relation extraction action-object interactions crossmodal generation models action-centric relation triplets multi-label action classification

Abstract

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a crossmodal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE.

Citation History

Jan 28, 2026

Feb 13, 2026

2+2

Feb 13, 2026