Poster "visual grounding" Papers
37 papers found
Conference
Acknowledging Focus Ambiguity in Visual Questions
Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li et al.
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Junli Liu, Qizhi Chen, Zhigang Wang et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding
Muye Huang, Lingling Zhang, Jie Ma et al.
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu, Sheng Jin, Wenwei Zhang et al.
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Zijun Lin, Shuting He, Cheston Tan et al.
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
Latent Expression Generation for Referring Image Segmentation and Grounding
Seonghoon Yu, Junbeom Hong, Joonseok Lee et al.
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Cong Chen, Mingyu Liu, Chenchen Jing et al.
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni, Zhengyuan Yang, Linjie Li et al.
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Ming Dai, Wenxuan Cheng, Jiedong Zhuang et al.
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models
Heng Yin, Yuqiang Ren, Ke Yan et al.
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh, Nimrod Shabtay, Eli Schwartz et al.
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma et al.
Vision Function Layer in Multimodal LLMs
Cheng Shi, Yizhou Yu, Sibei Yang
Visually Consistent Hierarchical Image Classification
Seulki Park, Youren Zhang, Stella Yu et al.
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen, Long Chen, Yu Wu
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Yixuan Wu, Yizhou Wang, Shixiang Tang et al.
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
Danni Yang, Ruohan Dong, Jiayi Ji et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu et al.
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang, Hongyang Li, Feng Li et al.
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal, Seoyoung Ahn, Zhibo Yang et al.
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.
NExT-Chat: An LMM for Chat, Detection and Segmentation
Ao Zhang, Yuan Yao, Wei Ji et al.
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Xiaoyu Zhu, Hao Zhou, Pengfei Xing et al.
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li, Junfeng Wu, Weizhi Zhao et al.
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
Weitai Kang, Gaowen Liu, Shah Mubarak et al.
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Zekun Qi, Runpei Dong, Shaochen Zhang et al.
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan et al.
Visual Grounding for Object-Level Generalization in Reinforcement Learning
Haobin Jiang, Zongqing Lu
Visual Relationship Transformation
Xiaoyu Xu, Jiayan Qiu, Baosheng Yu et al.
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han, Fangrui Zhu, Qianru Lao et al.