"large multimodal models" Papers
51 papers found • Page 1 of 2
Conference
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Zicheng Zhang, Haoning Wu, Chunyi Li et al.
AIpparel: A Multimodal Foundation Model for Digital Garments
Kiyohiro Nakayama, Jan Ackermann, Timur Levent Kesdogan et al.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois et al.
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
Qingmei Li, Yang Zhang, Zurong Mai et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero et al.
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
Yuxuan Sun, Yixuan Si, Chenglu Zhu et al.
Does Spatial Cognition Emerge in Frontier Models?
Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl et al.
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li, Jinglin Xu, Yunzhen Zhao et al.
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark
Ming Li, Jike Zhong, Tianle Chen et al.
Federated Continual Instruction Tuning
Haiyang Guo, Fanhu Zeng, Fei Zhu et al.
Fine-Tuning Token-Based Large Multimodal Models: What Works, What Doesn’t and What's Next
Zhulin Hu, Yan Ma, Jiadi Su et al.
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu, Sheng Jin, Wenwei Zhang et al.
FlowPrune: Accelerating Attention Flow Calculation by Pruning Flow Network
Shuo Xu, Yu Chen, Shuxia Lin et al.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Jiawei Lin, Shizhao Sun, Danqing Huang et al.
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models
Zewei Zhang, Huan Liu, Jun Chen et al.
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Jonathan Roberts, Kai Han, Samuel Albanie
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Eunice Yiu, Maan Qraitem, Anisa Majhi et al.
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Yang et al.
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Jiarui Wang, Huiyu Duan, Yu Zhao et al.
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Junyan Ye, Baichuan Zhou, Zilong Huang et al.
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe et al.
Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang, Jiale Fu, chenduo hao et al.
MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang, Renrui Zhang, Ziyu Guo et al.
MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study
Yuqing Zhang, Yue Han, Shuanghe Zhu et al.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Cong Wei, Zheyang Xiong, Weiming Ren et al.
On Large Multimodal Models as Open-World Image Classifiers
Alessandro Conti, Massimiliano Mancini, Enrico Fini et al.
On the Out-Of-Distribution Generalization of Large Multimodal Models
Xingxuan Zhang, Jiansheng Li, Wenjing Chu et al.
PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask
Jeongho Kim, Hoiyeong Jin, Sunghyun Park et al.
Re-Imagining Multimodal Instruction Tuning: A Representation View
Yiyang Liu, James Liang, Ruixiang Tang et al.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
Zhijian Huang, Chengjian Feng, Baihui Xiao et al.
Seeing the Arrow of Time in Large Multimodal Models
Zihui (Sherry) Xue, Romy Luo, Kristen Grauman
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Zhenglin Huang, Jinwei Hu, Yiwei He et al.
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng, Yun Xing, Zesen Cheng et al.
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou, Haote Yang, Dairong Chen et al.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Ziyang Luo, Haoning Wu, Dongxu Li et al.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen, Xufang Luo, Dongsheng Li
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye, Honglin Lin, Leyan Ou et al.
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang et al.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil et al.
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang, Hongyang Li, Feng Li et al.
M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
Mingsheng Li, Xin Chen, Chi Zhang et al.
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
NExT-Chat: An LMM for Chat, Detection and Segmentation
Ao Zhang, Yuan Yao, Wei Ji et al.
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
Zijian Zhou, Zheng Zhu, Holger Caesar et al.
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
YUXUAN SUN, Hao Wu, Chenglu Zhu et al.
PSALM: Pixelwise Segmentation with Large Multi-modal Model
Zheng Zhang, YeYao Ma, Enming Zhang et al.
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu et al.