"vision language models" Papers
69 papers found • Page 1 of 2
Conference
Aligning Effective Tokens with Video Anomaly in Large Language Models
YINGXIAN Chen, Jiahui Liu, Ruidi Fan et al.
Are Large Vision Language Models Good Game Players?
Xinyu Wang, Bohan Zhuang, Qi Wu
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye, Yukang Gan, Yixiao Ge et al.
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Qizhou Chen, Taolin Zhang, Chengyu Wang et al.
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang, Yuchang Su, Yiming Liu et al.
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
kaiyuan Li, Xiaoyue Chen, Chen Gao et al.
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning
Ahmed Masry, Abhay Puri, Masoud Hashemi et al.
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Zeyi Huang, Yuyang Ji, Xiaofang Wang et al.
Can We Talk Models Into Seeing the World Differently?
Paul Gavrikov, Jovita Lukasik, Steffen Jung et al.
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu et al.
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Hongxin Zhang, Zeyuan Wang, Qiushi Lyu et al.
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
Ziyi Wu, Anil Kag, Ivan Skorokhodov et al.
Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration
Zhixuan Shen, Haonan Luo, Kexun Chen et al.
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding, Bolian Li, Ruqi Zhang
FastVLM: Efficient Vision Encoding for Vision Language Models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li et al.
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Jin Wang, Chenghui Lv, Xian Li et al.
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han et al.
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Simon Park, Abhishek Panigrahi, Yun Cheng et al.
GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
Haolong Yan, Yeqing Shen, Xin Huang et al.
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park, Minyeong Kim, Gunhee Kim
Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
Yiting Qu, Ziqing Yang, Yihan Ma et al.
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.
Improving Large Vision and Language Models by Learning from a Panel of Peers
Jefferson Hernandez, Jing Shi, Simon Jenni et al.
Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo et al.
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu, Jingwei Sun, Yueqian Lin et al.
Knowledge Transfer from Interaction Learning
Yilin Gao, Kangyi Chen, Zhongxing Peng et al.
Making Large Vision Language Models to Be Good Few-Shot Learners
Fan Liu, Wenwen Cai, Jian Huo et al.
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation
Bohan Zhou, Yi Zhan, Zhongbin Zhang et al.
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu, Yaoming Wang, Bowen Shi et al.
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Yicheng Xiao, Lin Song, Yukang Chen et al.
MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu, Hao Fei, Yuhui Zhang et al.
MVGBench: a Comprehensive Benchmark for Multi-view Generation Models
Xianghui Xie, Jan Lenssen, Gerard Pons-Moll
OOD-Barrier: Build a Middle-Barrier for Open-Set Single-Image Test Time Adaptation via Vision Language Models
Boyang Peng, Sanqing Qu, Tianpei Zou et al.
Open-ended Hierarchical Streaming Video Understanding with Vision Language Models
Hyolim Kang, Yunsu Park, Youngbeom Yoo et al.
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
Jenny Schmalfuss, Nadine Chang, Vibashan VS et al.
Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang et al.
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
Wei Suo, Ji Ma, Mengyang Sun et al.
Rethinking Layered Graphic Design Generation with a Top-Down Approach
Jingye Chen, Zhaowen Wang, Nanxuan Zhao et al.
Semantic Discrepancy-aware Detector for Image Forgery Identification
Wang Ziye, Minghang Yu, Chunyan Xu et al.
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan et al.
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Samir Khaki, Junxian Guo, Jiaming Tang et al.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
Yongting Zhang, Lu Chen, Guodong Zheng et al.
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang, Ziming Cheng, Junting Pan et al.
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations
Lehan He, Zeren Chen, Zhelun Shi et al.
Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Zhecheng Li, Guoxian Song, Yujun Cai et al.
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Deepayan Das, Davide Talon, Yiming Wang et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
Vision Language Models are In-Context Value Learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.