"visual grounding" Papers

54 papers found • Page 1 of 2

Acknowledging Focus Ambiguity in Visual Questions

Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li et al.

ICCV 2025arXiv:2501.02201
4
citations

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Junli Liu, Qizhi Chen, Zhigang Wang et al.

ICCV 2025arXiv:2504.07836
7
citations

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025arXiv:2501.04003
74
citations

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia, Bingkui Tong, Yuhang Zang et al.

ICCV 2025highlightarXiv:2507.02859
3
citations

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang, Lingling Zhang, Jie Ma et al.

NEURIPS 2025arXiv:2505.19076
5
citations

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.

ICCV 2025arXiv:2508.11616

COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Jiansheng Li, Xingxuan Zhang, Hao Zou et al.

CVPR 2025highlightarXiv:2504.10158
1
citations

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Yuxuan Wang, Yijun Liu, Fei Yu et al.

AAAI 2025paperarXiv:2407.01081
7
citations

DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models

Kaishen Wang, Hengrui Gu, Meijun Gao et al.

ICLR 2025
7
citations

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria, Adinath Dukre, feilong tang et al.

NEURIPS 2025oralarXiv:2506.15649

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang, Amish Sethi, Matthew Kuo et al.

NEURIPS 2025oralarXiv:2510.15963

F-LMM: Grounding Frozen Large Multimodal Models

Size Wu, Sheng Jin, Wenwei Zhang et al.

CVPR 2025arXiv:2406.05821
22
citations

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin, Shuting He, Cheston Tan et al.

ICCV 2025arXiv:2506.21188
2
citations

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang, Changle Zhou, Jiawei Kong et al.

NEURIPS 2025arXiv:2505.19678
9
citations

Latent Expression Generation for Referring Image Segmentation and Grounding

Seonghoon Yu, Junbeom Hong, Joonseok Lee et al.

ICCV 2025arXiv:2508.05123
1
citations

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.

NEURIPS 2025arXiv:2506.01946
22
citations

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025highlightarXiv:2507.04047
28
citations

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing et al.

ICLR 2025arXiv:2503.06486
29
citations

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni, Zhengyuan Yang, Linjie Li et al.

NEURIPS 2025arXiv:2505.19702
13
citations

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le, Long Hoang Dang, Ngan Hoang Le et al.

AAAI 2025paperarXiv:2412.08125
3
citations

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai, Wenxuan Cheng, Jiedong Zhuang et al.

ICCV 2025arXiv:2509.04833
3
citations

ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

Heng Yin, Yuqiang Ren, Ke Yan et al.

CVPR 2025
8
citations

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Huy Ta, Duy Anh Huynh, Yutong Xie et al.

ICCV 2025highlightarXiv:2505.15123
2
citations

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari et al.

CVPR 2025highlightarXiv:2504.02823
2
citations

Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh, Nimrod Shabtay, Eli Schwartz et al.

ICCV 2025arXiv:2411.13317
3
citations

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma et al.

ICCV 2025arXiv:2410.10491

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang, Xingxing Weng, Jiang Wu et al.

AAAI 2025paperarXiv:2403.20213
54
citations

Vision Function Layer in Multimodal LLMs

Cheng Shi, Yizhou Yu, Sibei Yang

NEURIPS 2025arXiv:2509.24791
4
citations

Visually Consistent Hierarchical Image Classification

Seulki Park, Youren Zhang, Stella Yu et al.

ICLR 2025arXiv:2406.11608
4
citations

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

CVPR 2025highlightarXiv:2503.06287
37
citations

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Wei Chen, Long Chen, Yu Wu

ECCV 2024arXiv:2408.01120
17
citations

Cycle-Consistency Learning for Captioning and Grounding

Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024paperarXiv:2312.15162
14
citations

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Yixuan Wu, Yizhou Wang, Shixiang Tang et al.

ECCV 2024arXiv:2403.12488
48
citations

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

ECCV 2024arXiv:2407.05352
9
citations

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024arXiv:2401.06209
593
citations

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

CVPR 2024highlightarXiv:2312.09158
82
citations

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

ECCV 2024arXiv:2311.14552
31
citations

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Chuofan Ma, Yi Jiang, Jiannan Wu et al.

ECCV 2024arXiv:2404.13013
107
citations

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu et al.

AAAI 2024paperarXiv:2312.15043
26
citations

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang, Hongyang Li, Feng Li et al.

ECCV 2024arXiv:2312.02949
114
citations

Look Hear: Gaze Prediction for Speech-directed Human Attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang et al.

ECCV 2024arXiv:2407.19605
3
citations

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.

ECCV 2024arXiv:2312.03766
17
citations

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen et al.

CVPR 2024highlightarXiv:2311.16922
487
citations

NExT-Chat: An LMM for Chat, Detection and Segmentation

Ao Zhang, Yuan Yao, Wei Ji et al.

ICML 2024arXiv:2311.04498
78
citations

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Xiaoyu Zhu, Hao Zhou, Pengfei Xing et al.

ECCV 2024arXiv:2407.13642
11
citations

Parallel Vertex Diffusion for Unified Visual Grounding

Authors: Zesen Cheng, Kehan Li, Peng Jin et al.

AAAI 2024paperarXiv:2303.07216
37
citations

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li, Junfeng Wu, Weizhi Zhao et al.

ECCV 2024arXiv:2407.16696
13
citations

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang, Gaowen Liu, Shah Mubarak et al.

ECCV 2024arXiv:2407.03200
19
citations

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi, Runpei Dong, Shaochen Zhang et al.

ECCV 2024arXiv:2402.17766
120
citations

Unifying Visual and Vision-Language Tracking via Contrastive Learning

AAAI 2024paperarXiv:2401.11228
47
citations
PreviousNext