"vision-language model" Papers

21 papers found

CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

Yuxuan Luo, Jiaqi Tang, Chenyi Huang et al.

ICCV 2025arXiv:2503.06472

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

Dongseob Kim, Hyunjung Shim

CVPR 2025arXiv:2503.16873

Focus-Then-Reuse: Fast Adaptation in Visual Perturbation Environments

Jiahui Wang, Chao Chen, Jiacheng Xu et al.

NEURIPS 2025

Image as a World: Generating Interactive World from Single Image via Panoramic Video Generation

Dongnan Gui, Xun Guo, Wengang Zhou et al.

NEURIPS 2025oral
1
citations

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li et al.

NEURIPS 2025arXiv:2505.20275
98
citations

IntelliCap: Intelligent Guidance for Consistent View Sampling

Ayaka Yasunaga, Hideo Saito, Dieter Schmalstieg et al.

ISMAR 2025paperarXiv:2508.13043

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Jinsol Song, Jiamu Wang, Anh Nguyen et al.

ICCV 2025arXiv:2508.15256
1
citations

One-for-All Few-Shot Anomaly Detection via Instance-Induced Prompt Learning

Wenxi Lv, Qinliang Su, Wenchao Xu

ICLR 2025
15
citations

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi et al.

NEURIPS 2025arXiv:2509.26555

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Ayush Gupta, Anirban Roy, Rama Chellappa et al.

ICCV 2025arXiv:2506.09445

Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space

Zelin Peng, Zhengqin Xu, Zhilin Zeng et al.

CVPR 2025
5
citations

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

yitong jiang, Zhaoyang Zhang, Tianfan Xue et al.

ECCV 2024arXiv:2310.10123
88
citations

Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing

SI-QI LIU, Qirui Wang, Pong Chi Yuen

ECCV 2024
8
citations

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

CVPR 2024highlightarXiv:2311.08046
364
citations

Dolphins: Multimodal Language Model for Driving

Yingzi Ma, Yulong Cao, Jiachen Sun et al.

ECCV 2024arXiv:2312.00438
128
citations

Image Fusion via Vision-Language Model

Zixiang Zhao, Lilun Deng, Haowen Bai et al.

ICML 2024arXiv:2402.02235
67
citations

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.

CVPR 2024arXiv:2404.04231
20
citations

PALM: Predicting Actions through Language Models

Sanghwan Kim, Daoji Huang, Yongqin Xian et al.

ECCV 2024arXiv:2311.17944
23
citations

PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

Xiaofan Li, Zhizhong Zhang, Xin Tan et al.

CVPR 2024arXiv:2404.05231
114
citations

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024arXiv:2303.10571
15
citations

Retrieval Across Any Domains via Large-scale Pre-trained Model

Jiexi Yan, Zhihui Yin, Chenghao Xu et al.

ICML 2024