rleak.com - Spot the Future of AI Research

#1

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al.

ECCV 2024

3,440

citations

#2

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao

ECCV 2024

3,033

citations

#3

MMBENCH: Is Your Multi-Modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang et al.

ECCV 2024

1,745

citations

#4

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jinsong Li, Xiaoyi Dong et al.

ECCV 2024

970

citations

#5

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen et al.

ECCV 2024

639

citations

#6

Adversarial Diffusion Distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann et al.

ECCV 2024

629

citations

#7

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

Hang Guo, Jinmin Li, Tao Dai et al.

ECCV 2024

560

citations

#8

Grounding Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, Jerome Revaud

ECCV 2024

541

citations

#9

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Yanwei Li, Chengyao Wang, Jiaya Jia

ECCV 2024

499

citations

#10

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang et al.

ECCV 2024

498

citations

#11

CoTracker: It is Better to Track Together

Nikita Karaev, Ignacio Rocco, Ben Graham et al.

ECCV 2024

466

citations

#12

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Nanye Ma, Mark Goldstein, Michael Albergo et al.

ECCV 2024

448

citations

#13

MobileNetV4: Universal Models for the Mobile Ecosystem

Danfeng Qin, Chas Leichner, Manolis Delakis et al.

ECCV 2024

434

citations

#14

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Jinbo Xing, Menghan Xia, Yong Zhang et al.

ECCV 2024

424

citations

#15

VideoMamba: State Space Model for Efficient Video Understanding

Kunchang Li, Xinhao Li, Yi Wang et al.

ECCV 2024

407

citations

#16

DriveLM: Driving with Graph Visual Question Answering

Chonghao Sima, Katrin Renz, Kashyap Chitta et al.

ECCV 2024

376

citations

#17

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Yuedong Chen, Haofei Xu, Chuanxia Zheng et al.

ECCV 2024

374

citations

#18

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024

368

citations

#19

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li et al.

ECCV 2024

357

citations

#20

Wavelet Convolutions for Large Receptive Fields

Shahaf Finder, Roy Amoyal, Eran Treister et al.

ECCV 2024

348

citations

ECCV

Top Papers in ECCV 2024

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

MMBENCH: Is Your Multi-Modal Model an All-around Player?

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Adversarial Diffusion Distillation

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

Grounding Image Matching in 3D with MASt3R

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

CoTracker: It is Better to Track Together

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

MobileNetV4: Universal Models for the Mobile Ecosystem

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

VideoMamba: State Space Model for Efficient Video Understanding

DriveLM: Driving with Graph Visual Question Answering

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Wavelet Convolutions for Large Receptive Fields