Lewei Lu

papers

5,718

total citations

papers (24)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024arXiv

2,295

citations

Planning-Oriented Autonomous Driving

CVPR 2023arXiv

1,076

citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv

994

citations

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

CVPR 2023arXiv

386

citations

Scene as Occupancy

ICCV 2023arXiv

228

citations

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

ICCV 2021arXiv

156

citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024arXiv

148

citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024arXiv

citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024arXiv

citations

ControlLLM: Augment Language Models with Tools by Searching on Graphs

ECCV 2024arXiv

citations

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

CVPR 2023arXiv

citations

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

ICLR 2025arXiv

citations

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

CVPR 2025arXiv

citations

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

CVPR 2025arXiv

citations

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

ICLR 2024arXiv

citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025arXiv

citations

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

CVPR 2025arXiv

citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025arXiv

citations

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

NEURIPS 2025arXiv

citations

Spatial Preference Rewarding for MLLMs Spatial Understanding

ICCV 2025arXiv

citations

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

CVPR 2023

citations

Lewei Lu

papers (24)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Planning-Oriented Autonomous Driving

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Scene as Occupancy

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Docopilot: Improving Multimodal Models for Document-Level Understanding

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Weakly Supervised Monocular 3D Detection with a Single-View Image

Modeling Continuous Motion for 3D Point Cloud Object Tracking

Masked AutoDecoder is Effective Multi-Task Vision Generalist

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Spatial Preference Rewarding for MLLMs Spatial Understanding

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

papers (24)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Planning-Oriented Autonomous Driving

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Scene as Occupancy

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Docopilot: Improving Multimodal Models for Document-Level Understanding

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Weakly Supervised Monocular 3D Detection with a Single-View Image

Modeling Continuous Motion for 3D Point Cloud Object Tracking

Masked AutoDecoder is Effective Multi-Task Vision Generalist

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Spatial Preference Rewarding for MLLMs Spatial Understanding

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection