Licheng Yu

papers

1,702

total citations

papers (22)

UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020arXiv

469

citations

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

ECCV 2020arXiv

329

citations

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

ECCV 2020arXiv

139

citations

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

CVPR 2024arXiv

citations

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

ECCV 2022arXiv

citations

Apollo: An Exploration of Video Understanding in Large Multimodal Models

CVPR 2025arXiv

citations

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

CVPR 2023arXiv

citations

BachGAN: High-Resolution Image Synthesis From Salient Object Layout

CVPR 2020arXiv

citations

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

CVPR 2023arXiv

citations

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

CVPR 2024arXiv

citations

Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment

CVPR 2022arXiv

citations

Connecting What To Say With Where To Look by Modeling Human Attention Traces

CVPR 2021arXiv

citations

CiT: Curation in Training for Effective Vision-Language Data

ICCV 2023arXiv

citations

Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

CVPR 2024arXiv

citations

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

CVPR 2025arXiv

citations

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

CVPR 2025arXiv

citations

ROICtrl: Boosting Instance Control for Visual Generation

CVPR 2025arXiv

citations

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

ECCV 2022

citations

Licheng Yu

papers (22)

UNITER: UNiversal Image-TExt Representation Learning

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Violin: A Large-Scale Dataset for Video-and-Language Inference

AVID: Any-Length Video Inpainting with Diffusion Model

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

BachGAN: High-Resolution Image Synthesis From Salient Object Layout

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment

Connecting What To Say With Where To Look by Modeling Human Attention Traces

CiT: Curation in Training for Effective Vision-Language Data

Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

ROICtrl: Boosting Instance Control for Visual Generation

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

papers (22)

UNITER: UNiversal Image-TExt Representation Learning

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Violin: A Large-Scale Dataset for Video-and-Language Inference

AVID: Any-Length Video Inpainting with Diffusion Model

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

BachGAN: High-Resolution Image Synthesis From Salient Object Layout

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment

Connecting What To Say With Where To Look by Modeling Human Attention Traces

CiT: Curation in Training for Effective Vision-Language Data

Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

ROICtrl: Boosting Instance Control for Visual Generation

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"