"vision-language models" Papers
570 papers found • Page 2 of 12
Conference
Class Distribution-induced Attention Map for Open-vocabulary Semantic Segmentations
Dong Un Kang, Hayeon Kim, Se Young Chun
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi, Ming Ding, Weihan Wang et al.
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
Francesco Taioli, Edoardo Zorzi, Gianni Franchi et al.
Compositional Entailment Learning for Hyperbolic Vision-Language Models
Avik Pal, Max van Spengler, Guido D'Amely di Melendugno et al.
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
Zihui Cheng, Qiguang Chen, Jin Zhang et al.
Conditional Representation Learning for Customized Tasks
Honglin Liu, Chao Sun, Peng Hu et al.
Context-Aware Academic Emotion Dataset and Benchmark
Luming Zhao, Jingwen Xuan, Jiamin Lou et al.
Context-Aware Multimodal Pretraining
Karsten Roth, Zeynep Akata, Dima Damen et al.
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang, Jingyan Jiang, Qinting Jiang et al.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li et al.
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
Seung Hyun Lee, Jijun jiang, Yiran Xu et al.
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective
Xinmiao Yu, Xiaocheng Feng, Yun Li et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci et al.
CrypticBio: A Large Multimodal Dataset for Visually Confusing Species
Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Aniket Rege, Zinnia Nie, Unmesh Raskar et al.
CURV: Coherent Uncertainty-Aware Reasoning in Vision-Language Models for X-Ray Report Generation
Ziao Wang, Sixing Yan, Kejing Yin et al.
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang, Yijun Liu, Fei Yu et al.
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Hyungyung Lee, Geon Choi, Jung-Oh Lee et al.
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Chuhan ZHANG, Chaoyang Zhu, Pingcheng Dong et al.
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
Dataset Distillation via Vision-Language Category Prototype
YAWEN ZOU, Guang Li, Duo Su et al.
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Junjie Wang, BIN CHEN, Yulin Li et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach
Yanming Xiu, Maria Gorlatova
DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
Zhen Qu, Xian Tao, Xinyi Gong et al.
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Tao Zhang, Cheng Da, Kun Ding et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
Chanyoung Kim, Dayun Ju, Woojung Han et al.
DistinctAD: Distinctive Audio Description Generation in Contexts
Bo Fang, Wenhao Wu, Qiangqiang Wu et al.
Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning
Yilun Li, Miaomiao Cheng, Xu Han et al.
DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz et al.
Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns
Menghao Zhang, Huazheng Wang, Pengfei Ren et al.
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu, Anette Frank
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Tianhong Zhou, xu yin, Yingtao Zhu et al.
DS-VLM: Diffusion Supervision Vision Language Model
Zhen Sun, Yunhang Shen, Jie Li et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
Dual-Process Image Generation
Grace Luo, Jonathan Granskog, Aleksander Holynski et al.
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria, Adinath Dukre, feilong tang et al.
DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong et al.
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.
Each Complexity Deserves a Pruning Policy
Hanshi Wang, Yuhao Xu, Zekun Xu et al.
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Ronghao Dang, Yuqian Yuan, Wenqi Zhang et al.
EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Jeonghyeon Kim, Sangheum Hwang