Tong Lu

papers

12,212

total citations

papers (25)

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

ICCV 2021arXiv

4,656

citations

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024arXiv

2,295

citations

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

ECCV 2022arXiv

1,720

citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv

994

citations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

NEURIPS 2023arXiv

625

citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024arXiv

148

citations

FB-BEV: BEV Representation from Forward-Backward View Transformations

ICCV 2023arXiv

126

citations

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

ICLR 2024arXiv

118

citations

AVSegFormer: Audio-Visual Segmentation with Transformer

AAAI 2024arXiv

citations

Memory-and-Anticipation Transformer for Online Action Understanding

ICCV 2023arXiv

citations

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

ICLR 2025arXiv

citations

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

AAAI 2024arXiv

citations

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

NEURIPS 2021arXiv

citations

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

ECCV 2020arXiv

citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025arXiv

citations

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

NEURIPS 2025arXiv

citations

Deconfound Semantic Shift and Incompleteness in Incremental Few-shot Semantic Segmentation

AAAI 2025

citations

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

ICCV 2025

citations

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation

CVPR 2024

citations

Tong Lu

papers (25)

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

TAM: Temporal Adaptive Module for Video Recognition

DDP: Diffusion Model for Dense Visual Prediction

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

SeedFormer: Patch Seeds Based Point Cloud Completion with Upsample Transformer

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Adaptive Graph Convolution for Point Cloud Analysis

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

FB-BEV: BEV Representation from Forward-Backward View Transformations

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

AVSegFormer: Audio-Visual Segmentation with Transformer

Memory-and-Anticipation Transformer for Online Action Understanding

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

Docopilot: Improving Multimodal Models for Document-Level Understanding

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Deconfound Semantic Shift and Incompleteness in Incremental Few-shot Semantic Segmentation

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation

papers (25)

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

TAM: Temporal Adaptive Module for Video Recognition

DDP: Diffusion Model for Dense Visual Prediction

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

SeedFormer: Patch Seeds Based Point Cloud Completion with Upsample Transformer

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Adaptive Graph Convolution for Point Cloud Analysis

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

FB-BEV: BEV Representation from Forward-Backward View Transformations

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

AVSegFormer: Audio-Visual Segmentation with Transformer

Memory-and-Anticipation Transformer for Online Action Understanding

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

Docopilot: Improving Multimodal Models for Document-Level Understanding

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Deconfound Semantic Shift and Incompleteness in Incremental Few-shot Semantic Segmentation

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation