Josef Sivic

papers

3,040

total citations

papers (26)

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

CVPR 2020arXiv

761

citations

CosyPose: Consistent multi-view multi-object 6D pose estimation

ECCV 2020arXiv

501

citations

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

ICCV 2021arXiv

338

citations

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023arXiv

332

citations

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

CVPR 2022arXiv

citations

VidChapters-7M: Video Chapters at Scale

NEURIPS 2023arXiv

citations

Language-Guided Music Recommendation for Video via Prompt Analogies

CVPR 2023arXiv

citations

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

ECCV 2022arXiv

citations

Learning to design protein-protein interactions with enhanced generalization

ICLR 2024arXiv

citations

Focal Length and Object Pose Estimation via Render and Compare

CVPR 2022arXiv

citations

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

ICCV 2021arXiv

citations

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

CVPR 2023arXiv

citations

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

CVPR 2025arXiv

citations

Learning to engineer protein flexibility

ICLR 2025arXiv

citations

Large-scale Pre-training for Grounded Video Caption Generation

ICCV 2025arXiv

citations

Improving Personalized Search with Regularized Low-Rank Parameter Updates

CVPR 2025arXiv

citations

ResidualViT for Efficient Temporally Dense Video Encoding

ICCV 2025arXiv

citations

Learning Actionness via Long-range Temporal Order Verification

ECCV 2020

citations

Discovering Divergent Representations between Text-to-Image Models

ICCV 2025arXiv

citations

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

CVPR 2024

citations

Josef Sivic

papers (26)

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

CosyPose: Consistent multi-view multi-object 6D pose estimation

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

TubeDETR: Spatio-Temporal Video Grounding With Transformers

Single-View Robot Pose and Joint Angle Estimation via Render & Compare

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

VidChapters-7M: Video Chapters at Scale

Language-Guided Music Recommendation for Video via Prompt Analogies

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

Learning to design protein-protein interactions with enhanced generalization

Focal Length and Object Pose Estimation via Render and Compare

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Learning to engineer protein flexibility

Large-scale Pre-training for Grounded Video Caption Generation

Improving Personalized Search with Regularized Low-Rank Parameter Updates

ResidualViT for Efficient Temporally Dense Video Encoding

Learning Actionness via Long-range Temporal Order Verification

Discovering Divergent Representations between Text-to-Image Models

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

papers (26)

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

CosyPose: Consistent multi-view multi-object 6D pose estimation

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

TubeDETR: Spatio-Temporal Video Grounding With Transformers

Single-View Robot Pose and Joint Angle Estimation via Render & Compare

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

VidChapters-7M: Video Chapters at Scale

Language-Guided Music Recommendation for Video via Prompt Analogies

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

Learning to design protein-protein interactions with enhanced generalization

Focal Length and Object Pose Estimation via Render and Compare

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Learning to engineer protein flexibility

Large-scale Pre-training for Grounded Video Caption Generation

Improving Personalized Search with Regularized Low-Rank Parameter Updates

ResidualViT for Efficient Temporally Dense Video Encoding

Learning Actionness via Long-range Temporal Order Verification

Discovering Divergent Representations between Text-to-Image Models

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos