Wenwei Zhang

papers

2,665

total citations

papers (24)

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

ICCV 2025

127

citations

EcoNAS: Finding Proxies for Economical Neural Architecture Search

CVPR 2020arXiv

125

citations

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

CVPR 2022arXiv

111

citations

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

ICLR 2024arXiv

110

citations

OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR 2024arXiv

108

citations

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

ICLR 2024arXiv

101

citations

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

ICCV 2025arXiv

citations

Can AI Assistants Know What They Don't Know?

ICML 2024arXiv

citations

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

CVPR 2023arXiv

citations

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

ICCV 2025arXiv

citations

OV-PARTS: Towards Open-Vocabulary Part Segmentation

NEURIPS 2023arXiv

citations

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

ICCV 2023arXiv

citations

CLIM: Contrastive Language-Image Mosaic for Region Representation

AAAI 2024arXiv

citations

F-LMM: Grounding Frozen Large Multimodal Models

CVPR 2025arXiv

citations

Dense Siamese Network for Dense Unsupervised Learning

ECCV 2022arXiv

citations

Rethinking Verification for LLM Code Generation: From Generation to Testing

NEURIPS 2025arXiv

citations

Wenwei Zhang

papers (24)

K-Net: Towards Unified Image Segmentation

Seesaw Loss for Long-Tailed Instance Segmentation

Dense Distinct Query for End-to-End Object Detection

Aligning Bag of Regions for Open-Vocabulary Object Detection

Side-Aware Boundary Localization for More Precise Object Detection

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

EcoNAS: Finding Proxies for Economical Neural Architecture Search

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

OMG-Seg: Is One Model Good Enough For All Segmentation?

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Can AI Assistants Know What They Don't Know?

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

OV-PARTS: Towards Open-Vocabulary Part Segmentation

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

CLIM: Contrastive Language-Image Mosaic for Region Representation

F-LMM: Grounding Frozen Large Multimodal Models

Dense Siamese Network for Dense Unsupervised Learning

Rethinking Verification for LLM Code Generation: From Generation to Testing

papers (24)

K-Net: Towards Unified Image Segmentation

Seesaw Loss for Long-Tailed Instance Segmentation

Dense Distinct Query for End-to-End Object Detection

Aligning Bag of Regions for Open-Vocabulary Object Detection

Side-Aware Boundary Localization for More Precise Object Detection

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

EcoNAS: Finding Proxies for Economical Neural Architecture Search

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

OMG-Seg: Is One Model Good Enough For All Segmentation?

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Can AI Assistants Know What They Don't Know?

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

OV-PARTS: Towards Open-Vocabulary Part Segmentation

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

CLIM: Contrastive Language-Image Mosaic for Region Representation

F-LMM: Grounding Frozen Large Multimodal Models

Dense Siamese Network for Dense Unsupervised Learning

Rethinking Verification for LLM Code Generation: From Generation to Testing