Rui Qian

papers

3,678

total citations

papers (22)

Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation

CVPR 2021arXiv

1,178

citations

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

NEURIPS 2021arXiv

689

citations

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

NEURIPS 2020arXiv

149

citations

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

CVPR 2022arXiv

135

citations

VideoPrism: A Foundational Visual Encoder for Video Understanding

ICML 2024arXiv

citations

Motion-Aware Contrastive Video Representation Learning via Foreground-Background Merging

CVPR 2022arXiv

citations

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

ICCV 2025arXiv

citations

Enhancing Self-Supervised Video Representation Learning via Multi-Level Feature Optimization

ICCV 2021arXiv

citations

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

CVPR 2025arXiv

citations

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

CVPR 2025arXiv

citations

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

ICCV 2023arXiv

citations

Static and Dynamic Concepts for Self-Supervised Video Representation Learning

ECCV 2022arXiv

citations

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

ICCV 2023arXiv

citations

Contextualized Spatio-Temporal Contrastive Learning With Self-Supervision

CVPR 2022arXiv

citations

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

ECCV 2022arXiv

citations

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

ECCV 2024arXiv

citations

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

ECCV 2024arXiv

citations

Reasoning to Attend: Try to Understand How <SEG> Token Works

CVPR 2025arXiv

citations

Rui Qian

papers (22)

Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Spatiotemporal Contrastive Video Representation Learning

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

Multiple Sound Sources Localization from Coarse to Fine

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

VideoPrism: A Foundational Visual Encoder for Video Understanding

Motion-Aware Contrastive Video Representation Learning via Foreground-Background Merging

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Enhancing Self-Supervised Video Representation Learning via Multi-Level Feature Optimization

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Static and Dynamic Concepts for Self-Supervised Video Representation Learning

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Contextualized Spatio-Temporal Contrastive Learning With Self-Supervision

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Reasoning to Attend: Try to Understand How <SEG> Token Works

papers (22)

Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Spatiotemporal Contrastive Video Representation Learning

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

Multiple Sound Sources Localization from Coarse to Fine

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

VideoPrism: A Foundational Visual Encoder for Video Understanding

Motion-Aware Contrastive Video Representation Learning via Foreground-Background Merging

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Enhancing Self-Supervised Video Representation Learning via Multi-Level Feature Optimization

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Static and Dynamic Concepts for Self-Supervised Video Representation Learning

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Contextualized Spatio-Temporal Contrastive Learning With Self-Supervision

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Reasoning to Attend: Try to Understand How <SEG> Token Works