Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

0citations

arXiv:2511.03725 Project

citations

#3347

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Jongseo Lee Wooil Lee Gyeong-Moon Park Seong Tae Kim Jinwoo Choi

Topics

video action recognition explainable ai concept bottleneck models motion dynamics disentanglement human pose sequences large language models model debugging failure analysis

Abstract

Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 3, 2026

Feb 13, 2026