Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval

1citations
PDFProject
1
citations
#1733
in AAAI 2025
of 3028 papers
5
Top Authors
2
Data Points

Abstract

In the era of big data, cross-modal retrieval is increasingly important in research and application. Given the latent complexity and non-intuitive nature of cross-modal relationships, leveraging external knowledge such as large models has become a popular approach to facilitate modality alignment. Existing methods typically address these challenges by fine-tuning model encoders or using a fixed number of prompts. However, these approaches struggle with the significant information asymmetry between image-text pairs and the high distribution diversity of image data. These limitations not only introduce noise during training but also reduce the accuracy and generalization capabilities in cross-modal retrieval tasks. To address the above issues, this paper proposes Adaptive Prompt-Based Semantic Embedding with Inspired Potential of Implicit Knowledge (APSE-IPIK). On one hand, we propose an inspired potential strategy to extract fine-grained and multi-perspective text descriptions from large-scale pre-trained multimodal models, which can be seen as implicit knowledge injection. These descriptions are integrated into the visual-semantic embedding through cross-modal semantic alignment with images, balancing the information asymmetry between modalities and reducing the embedding of inaccurate mapping relationships. On the other hand, we construct an instance-level query-based prompt pool strategy to adaptively extract the most relevant prompts, addressing alignment biases caused by intra-modal (especially image) data diversity and improving alignment accuracy. Extensive experiments are conducted on two widely used datasets, Flickr30k and MSCOCO, which show the effectiveness of the proposed method.

Citation History

Jan 27, 2026
1
Feb 7, 2026
1