"vision-language models" Papers

570 papers found • Page 8 of 12

The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models

Alessandro Serra, Francesco Ortu, Emanuele Panizon et al.

NEURIPS 2025arXiv:2412.06646
1
citations

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Cheng Yang, Yang Sui, Jinqi Xiao et al.

CVPR 2025arXiv:2503.18278
24
citations

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Young Kyun Jang, Ser-Nam Lim

ICCV 2025arXiv:2405.14715
2
citations

Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product

Paul Albert, Frederic Zhang, Hemanth Saratchandran et al.

ICCV 2025arXiv:2508.00230
4
citations

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo, Xugong Qin, Jun Jie Ou Yang et al.

CVPR 2025arXiv:2512.20174
1
citations

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

Sudong Wang, Yunjian Zhang, Yao Zhu et al.

CVPR 2025arXiv:2504.02862
3
citations

Training-Free Generation of Temporally Consistent Rewards from VLMs

Yinuo Zhao, Jiale Yuan, Zhiyuan Xu et al.

ICCV 2025arXiv:2507.04789
2
citations

Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models

Shenglong Zhou, Manjiang Yin, Leiyu Sun et al.

NEURIPS 2025

TRAP: Targeted Redirecting of Agentic Preferences

Hangoo Kang, Jehyeok Yeon, Gagandeep Singh

NEURIPS 2025arXiv:2505.23518
3
citations

Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

jusheng zhang, Yijia Fan, Zimo Wen et al.

NEURIPS 2025

TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.

NEURIPS 2025oralarXiv:2512.01048

TULIP: Token-length Upgraded CLIP

Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki Asano et al.

ICLR 2025arXiv:2410.10034
17
citations

UIPro: Unleashing Superior Interaction Capability For GUI Agents

Hongxin Li, Jingran Su, Jingfan CHEN et al.

ICCV 2025arXiv:2509.17328

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Yunheng Li, Yuxuan Li, Quan-Sheng Zeng et al.

ICCV 2025arXiv:2412.06244
6
citations

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Nina Shvetsova, Arsha Nagrani, Bernt Schiele et al.

CVPR 2025arXiv:2503.18637
1
citations

Understanding Co-speech Gestures in-the-wild

Sindhu Hegde, K R Prajwal, Taein Kwon et al.

ICCV 2025arXiv:2503.22668
2
citations

Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca et al.

ICCV 2025arXiv:2412.01370
1
citations

Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro et al.

NEURIPS 2025arXiv:2510.19307
2
citations

Unlearning the Noisy Correspondence Makes CLIP More Robust

Haochen Han, Alex Jinpeng Wang, Peijun Ye et al.

ICCV 2025arXiv:2507.03434
1
citations

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang, Fei Wei, Yong Wang et al.

ICCV 2025arXiv:2507.00721

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin et al.

ICCV 2025arXiv:2412.09616
17
citations

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu et al.

AAAI 2025paperarXiv:2404.12353
50
citations

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng, Kai Han

NEURIPS 2025arXiv:2511.22664
1
citations

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025arXiv:2412.10471
31
citations

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen et al.

NEURIPS 2025

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa et al.

CVPR 2025arXiv:2504.09795
27
citations

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Muchao Ye, Weiyang Liu, Pan He

CVPR 2025arXiv:2412.01095
10
citations

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu et al.

ICCV 2025arXiv:2411.18651
1
citations

VideoAuteur: Towards Long Narrative Video Generation

Junfei Xiao, Feng Cheng, Lu Qi et al.

ICCV 2025arXiv:2501.06173
9
citations

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.

NEURIPS 2025arXiv:2505.15952
4
citations

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel, Walid Bousselham, Anna Kukleva et al.

CVPR 2025arXiv:2503.20348

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NEURIPS 2025arXiv:2505.01481
15
citations

VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou et al.

NEURIPS 2025arXiv:2506.09049
9
citations

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez et al.

ICCV 2025arXiv:2507.07620
2
citations

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Xueqing Wu, Yuheng Ding, Bingxuan Li et al.

CVPR 2025arXiv:2412.02172
13
citations

VisionArena: 230k Real World User-VLM Conversations with Preference Labels

Christopher Chou, Lisa Dunlap, Wei-Lin Chiang et al.

CVPR 2025arXiv:2412.08687
15
citations

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

Kuanghong Liu, Jin Wang, Kangjian He et al.

AAAI 2025paperarXiv:2503.06106
2
citations

Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan et al.

NEURIPS 2025spotlightarXiv:2502.00791
11
citations

Vision-Language Model IP Protection via Prompt-based Learning

Lianyu Wang, Meng Wang, Huazhu Fu et al.

CVPR 2025arXiv:2503.02393

Vision-Language Models Can't See the Obvious

YASSER ABDELAZIZ DAHOU DJILALI, Ngoc Huynh, Phúc Lê Khắc et al.

ICCV 2025arXiv:2507.04741
7
citations

Vision-Language Models Do Not Understand Negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian et al.

CVPR 2025arXiv:2501.09425
38
citations

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.

NEURIPS 2025arXiv:2507.07104
2
citations

Vision Transformers Don't Need Trained Registers

Nicholas Jiang, Amil Dravid, Alexei Efros et al.

NEURIPS 2025spotlightarXiv:2506.08010
15
citations

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das et al.

COLM 2025paperarXiv:2412.00947
22
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025arXiv:2509.15235
2
citations

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu et al.

ICLR 2025arXiv:2410.10594
127
citations

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar et al.

ICLR 2025arXiv:2405.15683
17
citations

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni, YuTao Fan, Lei Zhang et al.

ICLR 2025arXiv:2410.03321
20
citations

Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam, Soowon Son, Zhan Xu et al.

CVPR 2025arXiv:2503.15406
6
citations

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025arXiv:2503.01785
357
citations