Kunchang Li

OpenReview

papers

3,232

total citations

papers (15)

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

ICLR 2024arXiv

419

citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024arXiv

407

citations

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023arXiv

246

citations

Vlogger: Make Your Dream A Vlog

CVPR 2024arXiv

citations

Self-Slimmed Vision Transformer

ECCV 2022arXiv

citations

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

ECCV 2022arXiv

citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025arXiv

citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025arXiv

citations

Make Your Training Flexible: Towards Deployment-Efficient Video Models

ICCV 2025arXiv

citations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI 2025arXiv

citations

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

CVPR 2025arXiv

citations

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

ICCV 2023

citations

Kunchang Li

papers (15)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

PointCLIP: Point Cloud Understanding by CLIP

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

VideoMamba: State Space Model for Efficient Video Understanding

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Vlogger: Make Your Dream A Vlog

Self-Slimmed Vision Transformer

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

papers (15)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

PointCLIP: Point Cloud Understanding by CLIP

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

VideoMamba: State Space Model for Efficient Video Understanding

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Vlogger: Make Your Dream A Vlog

Self-Slimmed Vision Transformer

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding