α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Linjie Li
Linjie Li
1
Affiliations
Affiliations
Microsoft
29
papers
5,975
total citations
papers (29)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
arXiv
1,066
citations
Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
CVPR 2021
arXiv
756
citations
Segment Everything Everywhere All at Once
NEURIPS 2023
arXiv
703
citations
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
NEURIPS 2020
arXiv
540
citations
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020
arXiv
469
citations
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2024
arXiv
422
citations
Generalized Decoding for Pixel, Image, and Language
CVPR 2023
arXiv
336
citations
SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning
CVPR 2022
arXiv
309
citations
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023
arXiv
194
citations
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NEURIPS 2022
arXiv
153
citations
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
arXiv
139
citations
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
arXiv
131
citations
UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training
CVPR 2021
arXiv
108
citations
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
ICML 2025
arXiv
100
citations
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
CVPR 2023
arXiv
94
citations
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models
ICCV 2021
arXiv
94
citations
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
CVPR 2023
arXiv
83
citations
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023
arXiv
63
citations
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
arXiv
50
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025
arXiv
36
citations
Cross-Modal Representation Learning for Zero-Shot Action Recognition
CVPR 2022
arXiv
31
citations
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
arXiv
22
citations
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
arXiv
19
citations
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
arXiv
14
citations
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
NEURIPS 2025
arXiv
13
citations
Adaptive Human Matting for Dynamic Videos
CVPR 2023
arXiv
13
citations
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
ECCV 2024
arXiv
11
citations
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
arXiv
4
citations
Synthetic Visual Genome
CVPR 2025
arXiv
2
citations