Rohit Girdhar

Affiliations

Facebook AI Research

papers

9,559

total citations

papers (20)

Rohit Girdhar

Affiliations

papers (20)

Masked-Attention Mask Transformer for Universal Image Segmentation

Ego4D: Around the World in 3,000 Hours of Egocentric Video

ImageBind: One Embedding Space To Bind Them All

Detecting Twenty-Thousand Classes Using Image-Level Supervision

An End-to-End Transformer Model for 3D Object Detection

Self-Supervised Pretraining of 3D Features on Any Point-Cloud

Omnivore: A Single Model for Many Visual Modalities

Anticipative Video Transformer

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Learning Video Representations From Large Language Models

InstanceDiffusion: Instance-level Control for Image Generation

OmniMAE: Single Model Masked Pretraining on Images and Videos

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

HierVL: Learning Hierarchical Video-Language Embeddings

3D Spatial Recognition Without Spatially Labeled 3D

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Generating Illustrated Instructions

LLMs can see and hear without any training

papers (20)

Masked-Attention Mask Transformer for Universal Image Segmentation

Ego4D: Around the World in 3,000 Hours of Egocentric Video

ImageBind: One Embedding Space To Bind Them All

Detecting Twenty-Thousand Classes Using Image-Level Supervision

An End-to-End Transformer Model for 3D Object Detection

Self-Supervised Pretraining of 3D Features on Any Point-Cloud

Omnivore: A Single Model for Many Visual Modalities

Anticipative Video Transformer

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Learning Video Representations From Large Language Models

InstanceDiffusion: Instance-level Control for Image Generation

OmniMAE: Single Model Masked Pretraining on Images and Videos

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

HierVL: Learning Hierarchical Video-Language Embeddings

3D Spatial Recognition Without Spatially Labeled 3D

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Generating Illustrated Instructions

LLMs can see and hear without any training