rleak.com - Spot the Future of AI Research

#1

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025

347

citations

#2

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025

338

citations

#3

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

yi yang, Xiaoxuan He, Hongkun Pan et al.

ICCV 2025

247

citations

#4

OminiControl: Minimal and Universal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.

ICCV 2025

214

citations

#5

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

ICCV 2025

213

citations

#6

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, zehai he, Wenyi Hong et al.

ICCV 2025

208

citations

#7

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

ICCV 2025

206

citations

#8

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

ICCV 2025

169

citations

#9

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

CHENMING ZHU, Tai Wang, Wenwei Zhang et al.

ICCV 2025

127

citations

#10

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu et al.

ICCV 2025

96

citations

#11

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

ICCV 2025

89

citations

#12

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin, Jianwen Jiang, Jiaqi Yang et al.

ICCV 2025

86

citations

#13

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou, Hang Gao, Vikram Voleti et al.

ICCV 2025

83

citations

#14

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025

73

citations

#15

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.

ICCV 2025

73

citations

#16

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025

71

citations

#17

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025

70

citations

#18

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

ICCV 2025

63

citations

#19

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025

62

citations

#20

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.

ICCV 2025

62

citations

ICCV

Top Papers in ICCV 2025

Visual-RFT: Visual Reinforcement Fine-Tuning

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

OminiControl: Minimal and Universal Control for Diffusion Transformer

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

LVBench: An Extreme Long Video Understanding Benchmark

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

VACE: All-in-One Video Creation and Editing

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

MV-Adapter: Multi-View Consistent Image Generation Made Easy

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

GameFactory: Creating New Games with Generative Interactive Videos

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation