Youngjae Yu

papers

1,022

total citations

papers (18)

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

ICCV 2021arXiv

citations

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

ICCV 2023arXiv

citations

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

NEURIPS 2023arXiv

citations

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

AAAI 2025arXiv

citations

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

ICCV 2025arXiv

citations

VAGUE: Visual Contexts Clarify Ambiguous Expressions

ICCV 2025arXiv

citations

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

AAAI 2025arXiv

citations

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

ECCV 2024arXiv

citations

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

ICCV 2025arXiv

citations

Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos

ICCV 2021

citations

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

CVPR 2023

citations

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

ECCV 2020

citations

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

NEURIPS 2025arXiv

citations

MASS: Overcoming Language Bias in Image-Text Matching

AAAI 2025arXiv

citations

Transitional Adaptation of Pretrained Models for Visual Storytelling

CVPR 2021

citations

Youngjae Yu

papers (18)

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

VAGUE: Visual Contexts Clarify Ambiguous Expressions

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

MASS: Overcoming Language Bias in Image-Text Matching

Transitional Adaptation of Pretrained Models for Visual Storytelling

papers (18)

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

VAGUE: Visual Contexts Clarify Ambiguous Expressions

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

MASS: Overcoming Language Bias in Image-Text Matching

Transitional Adaptation of Pretrained Models for Visual Storytelling