Poster "referring expression comprehension" Papers

20 papers found

Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision

Weicai Yan, Wang Lin, Zirun Guo et al.

ICLR 2025arXiv:2504.21423
7
citations

Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo LIANG, Yiwu Zhong, Zi-Yuan Hu et al.

ICCV 2025arXiv:2508.00518
5
citations

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu et al.

NEURIPS 2025arXiv:2506.04897
1
citations

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin ZHANG, DONGSHENG An et al.

CVPR 2025arXiv:2505.13788
3
citations

Latent Expression Generation for Referring Image Segmentation and Grounding

Seonghoon Yu, Junbeom Hong, Joonseok Lee et al.

ICCV 2025arXiv:2508.05123
1
citations

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang et al.

CVPR 2025arXiv:2501.08326
9
citations

Referring Expression Comprehension for Small Objects

Kanoko Goto, Takumi Hirose, Mahiro Ukai et al.

ICCV 2025arXiv:2510.03701
1
citations

Referring to Any Person

Qing Jiang, Lin Wu, Zhaoyang Zeng et al.

ICCV 2025arXiv:2503.08507
14
citations

ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

Heng Yin, Yuqiang Ren, Ke Yan et al.

CVPR 2025
8
citations

Synthetic Visual Genome

Jae Sung Park, Zixian Ma, Linjie Li et al.

CVPR 2025arXiv:2506.07643
2
citations

Vision-Language Models Can't See the Obvious

YASSER ABDELAZIZ DAHOU DJILALI, Ngoc Huynh, Phúc Lê Khắc et al.

ICCV 2025arXiv:2507.04741
7
citations

WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

Silin Cheng, Yang Liu, Xinwei He et al.

CVPR 2025arXiv:2505.18686
3
citations

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Yixuan Wu, Yizhou Wang, Shixiang Tang et al.

ECCV 2024arXiv:2403.12488
48
citations

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Jiedong Zhuang, Jiaqi Hu, Lianrui Mu et al.

ECCV 2024arXiv:2407.05578
8
citations

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

ECCV 2024arXiv:2311.14552
31
citations

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al.

ECCV 2024arXiv:2303.05499
3442
citations

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang, Hongyang Li, Feng Li et al.

ECCV 2024arXiv:2312.02949
114
citations

Look Hear: Gaze Prediction for Speech-directed Human Attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang et al.

ECCV 2024arXiv:2407.19605
3
citations

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su, Peihan Miao, Huanzhang Dou et al.

CVPR 2024arXiv:2406.18048
16
citations

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao et al.

CVPR 2024arXiv:2311.17048
21
citations