Li Fei-Fei

papers

1,810

total citations

papers (21)

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

CVPR 2020arXiv

393

citations

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

CVPR 2025arXiv

371

citations

Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning

CVPR 2022arXiv

217

citations

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

ICCV 2025arXiv

citations

PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens

ECCV 2022arXiv

citations

Metadata Normalization

CVPR 2021arXiv

citations

Rendering Humans from Object-Occluded Monocular Videos

ICCV 2023arXiv

citations

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

CVPR 2025arXiv

citations

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

CVPR 2024arXiv

citations

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

ICCV 2025arXiv

citations

Scalable Differential Privacy With Sparse Network Finetuning

CVPR 2021

citations

Revisiting the "Video" in Video-Language Understanding

CVPR 2022

citations

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

ECCV 2020

citations

Li Fei-Fei

papers (21)

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Procedure Planning in Instructional Videos

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

WorldScore: Unified Evaluation Benchmark for World Generation

Re-thinking Temporal Search for Long-Form Video Understanding

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens

Metadata Normalization

Rendering Humans from Object-Occluded Monocular Videos

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Scalable Differential Privacy With Sparse Network Finetuning

Revisiting the "Video" in Video-Language Understanding

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

papers (21)

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Procedure Planning in Instructional Videos

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

WorldScore: Unified Evaluation Benchmark for World Generation

Re-thinking Temporal Search for Long-Form Video Understanding

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens

Metadata Normalization

Rendering Humans from Object-Occluded Monocular Videos

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Scalable Differential Privacy With Sparse Network Finetuning

Revisiting the "Video" in Video-Language Understanding

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition