α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Zhengyuan Yang
Zhengyuan Yang
29
papers
3,616
total citations
papers (29)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
arXiv
1,066
citations
TransVG: End-to-End Visual Grounding With Transformers
ICCV 2021
arXiv
443
citations
Scaling Up Vision-Language Pre-Training for Image Captioning
CVPR 2022
arXiv
300
citations
Improving One-stage Visual Grounding by Recursive Sub-query Construction
ECCV 2020
arXiv
292
citations
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023
arXiv
194
citations
TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption
CVPR 2021
arXiv
160
citations
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
ICCV 2021
arXiv
159
citations
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
arXiv
139
citations
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022
arXiv
135
citations
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
arXiv
131
citations
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
ICML 2025
arXiv
100
citations
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
CVPR 2021
arXiv
96
citations
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023
arXiv
63
citations
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
CVPR 2024
53
citations
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
arXiv
50
citations
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
ICML 2025
arXiv
49
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025
arXiv
36
citations
SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation
AAAI 2024
arXiv
26
citations
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
arXiv
22
citations
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025
arXiv
19
citations
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
arXiv
19
citations
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
ICLR 2025
arXiv
15
citations
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
arXiv
14
citations
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
NEURIPS 2025
arXiv
13
citations
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
ECCV 2024
arXiv
11
citations
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
arXiv
7
citations
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
arXiv
4
citations
StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis
ICML 2024
0
citations
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3
ICCV 2023
0
citations