Gedas Bertasius

papers

1,951

total citations

papers (22)

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

CVPR 2024arXiv

343

citations

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

CVPR 2020arXiv

191

citations

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

CVPR 2025arXiv

156

citations

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

CVPR 2021arXiv

citations

Efficient Movie Scene Detection Using State-Space Transformers

CVPR 2023arXiv

citations

Long-Short Temporal Contrastive Learning of Video Transformers

CVPR 2022arXiv

citations

ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound

ECCV 2022arXiv

citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

NEURIPS 2025arXiv

citations

LoCoNet: Long-Short Context Network for Active Speaker Detection

CVPR 2024arXiv

citations

COBE: Contextualized Object Embeddings from Narrated Instructional Video

NEURIPS 2020arXiv

citations

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

CVPR 2025arXiv

citations

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

CVPR 2025arXiv

citations

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

ECCV 2024arXiv

citations

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

CVPR 2025arXiv

citations

Gedas Bertasius

papers (22)

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Long Movie Clip Classification with State-Space Video Models

TALLFormer: Temporal Action Localization with a Long-Memory Transformer

Vision Transformers Are Parameter-Efficient Audio-Visual Learners

Learning To Recognize Procedural Activities With Distant Supervision

VindLU: A Recipe for Effective Video-and-Language Pretraining

Video ReCap: Recursive Captioning of Hour-Long Videos

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Efficient Movie Scene Detection Using State-Space Transformers

Long-Short Temporal Contrastive Learning of Video Transformers

ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

LoCoNet: Long-Short Context Network for Active Speaker Detection

COBE: Contextualized Object Embeddings from Narrated Instructional Video

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

papers (22)

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Long Movie Clip Classification with State-Space Video Models

TALLFormer: Temporal Action Localization with a Long-Memory Transformer

Vision Transformers Are Parameter-Efficient Audio-Visual Learners

Learning To Recognize Procedural Activities With Distant Supervision

VindLU: A Recipe for Effective Video-and-Language Pretraining

Video ReCap: Recursive Captioning of Hour-Long Videos

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Efficient Movie Scene Detection Using State-Space Transformers

Long-Short Temporal Contrastive Learning of Video Transformers

ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

LoCoNet: Long-Short Context Network for Active Speaker Detection

COBE: Contextualized Object Embeddings from Narrated Instructional Video

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos