Ruimao Zhang

papers

3,032

total citations

papers (27)

WorldSimBench: Towards Video Generation Models as World Simulators

ICML 2025arXiv

842

citations

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

NEURIPS 2022arXiv

461

citations

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring

ICCV 2021arXiv

175

citations

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

CVPR 2024arXiv

143

citations

HumanTOMATO: Text-aligned Whole-body Motion Generation

ICML 2024arXiv

111

citations

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

CVPR 2024arXiv

citations

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

ECCV 2022arXiv

citations

Open-World Human-Object Interaction Detection via Multi-modal Prompts

CVPR 2024arXiv

citations

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

ICCV 2023arXiv

citations

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

CVPR 2025arXiv

citations

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

ECCV 2024arXiv

citations

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

CVPR 2023arXiv

citations

Neural Interactive Keypoint Detection

ICCV 2023arXiv

citations

Exemplar Normalization for Learning Deep Representation

CVPR 2020arXiv

citations

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

ICCV 2025arXiv

citations

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

CVPR 2025arXiv

citations

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

AAAI 2024arXiv

citations

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

CVPR 2024arXiv

citations

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions

NEURIPS 2023arXiv

citations

Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation

ECCV 2020

citations

SEED-Bench: Benchmarking Multimodal Large Language Models

CVPR 2024

citations

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

CVPR 2020

citations

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

NEURIPS 2022

citations

Ruimao Zhang

papers (27)

WorldSimBench: Towards Video Generation Models as World Simulators

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Parser-Free Virtual Try-On via Distilling Appearance Flows

End-to-End Dense Video Captioning With Parallel Decoding

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

HumanTOMATO: Text-aligned Whole-body Motion Generation

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Open-World Human-Object Interaction Detection via Multi-modal Prompts

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

Neural Interactive Keypoint Detection

Exemplar Normalization for Learning Deep Representation

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions

Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation

SEED-Bench: Benchmarking Multimodal Large Language Models

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

papers (27)

WorldSimBench: Towards Video Generation Models as World Simulators

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Parser-Free Virtual Try-On via Distilling Appearance Flows

End-to-End Dense Video Captioning With Parallel Decoding

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

HumanTOMATO: Text-aligned Whole-body Motion Generation

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Open-World Human-Object Interaction Detection via Multi-modal Prompts

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

Neural Interactive Keypoint Detection

Exemplar Normalization for Learning Deep Representation

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions

Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation

SEED-Bench: Benchmarking Multimodal Large Language Models

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis