Learning Dynamic Similarity by Bidirectional Hierarchical Sliding Semantic Probe for Efficient Text Video Retrieval

3citations
PDFProject
3
citations
#1222
in AAAI 2025
of 3028 papers
4
Top Authors
2
Data Points

Abstract

Text-video retrieval is a foundation task in multi-modal research which aims to align texts and videos in the embedding space. The key challenge is to learn the similarity between videos and texts. A conventional approach involves directly aligning video-text pairs using cosine similarity. However, due to the disparity in the information conveyed by videos and texts, i.e., a single video can be described from multiple perspectives, the retrieval accuracy is suboptimal. An alternative approach employs cross-modal interaction to enable videos to dynamically acquire distinct features from various texts, thus facilitating similarity calculations. Nevertheless, this solution incurs a computational complexity of O(n^2) during retrieval. To this end, this paper proposes a novel method called Bidirectional Hierarchical Sliding Semantic Probe (BiHSSP), which calculates dynamic similarity between videos and texts with O(n) complexity during retrieval. We introduce a hierarchical semantic probe module that learns semantic probes at different scales for both video and text features. Semantic probe involves a sliding calculation of the cross-correlation between semantic probes at different scales and embeddings from another modality, allowing for dynamic similarity computation between video and text descriptions from various perspectives. Specifically, for text descriptions from different angles, we calculate the similarity at different locations within the video features and vice versa. This approach preserves the complete information of the video while addressing the issue of unequal information between video and text without requiring cross-modal interaction. Additionally, our method can function as a plug-and-play module across various methods, thereby enhancing the corresponding performance. Experimental results demonstrate that our BiHSSP significantly outperforms the baseline.

Citation History

Jan 27, 2026
3
Feb 7, 2026
3