Counterfactual Debiasing for Physical Audiovisual Commonsense Reasoning
Abstract
Physical commonsense is an essential aspect of human cognition, involving an intuitive understanding of the physical properties and interactions of everyday objects and materials. Though physical commonsense reasoning should inherently be a multisensory task, integrating both video and audio signals, existing physical audiovisual commonsense reasoning (PACR) models predominantly rely on visual information. This reliance leads to spurious correlations and undermines the models’ reasoning and generalization abilities. To counteract this, we introduce a model-agnostic Counterfactual Physical Audiovisual Commonsense Reasoning (CF-PACR) framework aimed at mitigating visual bias-induced spurious effects. Specifically, we construct a traditional PACR model using both audio and visual information as the factual reasoning model. Subsequently, in the counterfactual reasoning model, we isolate visual information to estimate direct effects. Finally, we subtract the direct effects from the total effects across modalities to derive indirect effects, thereby mitigating visual biases. Extensive experiments validate the effectiveness and generalizability of CF-PACR in alleviating the spurious correlations between visual modality and model predictions.