Zero-Shot Image Captioning with Multi-type Entity Representations

3citations
PDFProject
3
citations
#1222
in AAAI 2025
of 3028 papers
5
Top Authors
2
Data Points

Abstract

As data and computational resources continue to expand, incorporating a variety of knowledge during the pre-training phase enhances large models, providing them with strong zero-shot capabilities. Due to the alignment of modal features by visual language models, zero-shot image captioning no longer necessitates pre-training on paired image-text labeled data, enabling accurate text description generation for images not encountered before. While recent research focuses on methods utilizing entity retrieval as anchors to bridge the gap between different modalities, these approaches often fall short of thoroughly analyzing the impact of entity retrieval recall on the zero-shot generation capabilities. To address this issue, we propose MERCap, a zero-shot image captioning method employing Multi-type Entity representation Retrieval. More specifically, we first approximate image representation using the CLIP representation of text and Gaussian noise to address the modality gap. Then, we train a GPT-2 decoder to reconstruct text using entities as hard prompts and CLIP representations as soft prompts. Additionally, we construct a domain-specific entity set, assigning multiple representations to each entity and refining their representation vectors through contrastive learning. During inference, we retrieve entities and input them into the decoder to generate corresponding captions. Extensive experiments validate that our approach is efficient, achieving a new state-of-the-art level in cross-domain captioning and demonstrating strong competitiveness in in-domain captioning compared to existing methods.

Citation History

Jan 27, 2026
3
Feb 4, 2026
3