CVPT: Cross Visual Prompt Tuning

2citations

arXiv:2408.14961

citations

#1094

in ICCV 2025

of 2701 papers

Top Authors

Data Points

Top Authors

Lingyun Huang Jianxu Mao Junfei YI Ziming Tao Yaonan Wang

Abstract

Parameter-Efficient Fine-Tuning (PEFT) has emerged to mitigate the computational demands of large-scale models. Within computer vision, adapter-based PEFT methods are often favored over prompt-based approaches like Visual Prompt Tuning (VPT) due to the latter's performance and efficiency limitations. Our analysis reveals that VPT's shortcomings stem from its prompt deployment strategy, which can distort the model's inherent self-attention mechanism. To address this, we propose Cross Visual Prompt Tuning (CVPT). CVPT introduces a cross-attention module to directly model interactions between prompts and image tokens. This design decouples the prompts from the input sequence, preserving the original self-attention integrity while enabling efficient feature integration. Furthermore, we employ a weight-sharing mechanism for cross-attention initialization, which enhances representative capability without a large parameter overhead. Extensive experiments across 25 datasets show that CVPT significantly outperforms VPT. For instance, on the VTAB-1K benchmark, CVPT achieves over 4% higher average accuracy, rivaling leading adapter-based methods in both performance and efficiency. Our work confirms that prompt-based methods can achieve exceptional results in visual fine-tuning. The code is available at https://github.com/Lingyun0419/CVPT

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026

Feb 13, 2026

2+2

Feb 13, 2026