rleak.com - Spot the Future of AI Research

#1

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025

360

citations

#2

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025

357

citations

#3

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

yi yang, Xiaoxuan He, Hongkun Pan et al.

ICCV 2025

265

citations

#4

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Yuzhang Shang, Mu Cai, Bingxin Xu et al.

ICCV 2025

234

citations

#5

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, zehai he, Wenyi Hong et al.

ICCV 2025

229

citations

#6

OminiControl: Minimal and Universal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.

ICCV 2025

225

citations

#7

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

ICCV 2025

223

citations

#8

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

ICCV 2025

220

citations

#9

Shape of Motion: 4D Reconstruction from a Single Video

Qianqian Wang, Vickie Ye, Hang Gao et al.

ICCV 2025

186

citations

#10

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

ICCV 2025

181

citations

#11

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu et al.

ICCV 2025

150

citations

#12

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

CHENMING ZHU, Tai Wang, Wenwei Zhang et al.

ICCV 2025

127

citations

#13

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu et al.

ICCV 2025

113

citations

#14

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Jianhong Bai, Menghan Xia, Xiao Fu et al.

ICCV 2025

110

citations

#15

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas et al.

ICCV 2025

101

citations

#16

Randomized Autoregressive Visual Generation

Qihang Yu, Ju He, Xueqing Deng et al.

ICCV 2025

93

citations

#17

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin, Jianwen Jiang, Jiaqi Yang et al.

ICCV 2025

91

citations

#18

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

shaojin wu, Mengqi Huang, wenxu wu et al.

ICCV 2025

90

citations

#19

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

ICCV 2025

89

citations

#20

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou, Hang Gao, Vikram Voleti et al.

ICCV 2025

87

citations

ICCV

Top Papers in ICCV 2025

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Visual-RFT: Visual Reinforcement Fine-Tuning

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

LVBench: An Extreme Long Video Understanding Benchmark

OminiControl: Minimal and Universal Control for Diffusion Transformer

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Shape of Motion: 4D Reconstruction from a Single Video

VACE: All-in-One Video Creation and Editing

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Randomized Autoregressive Visual Generation

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Stable Virtual Camera: Generative View Synthesis with Diffusion Models