Jiasen Lu

Affiliations

Allen Institute of AI

papers

1,326

total citations

papers (11)

12-in-1: Multi-Task Vision and Language Representation Learning

CVPR 2020arXiv

500

citations

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

CVPR 2024arXiv

280

citations

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

CVPR 2022arXiv

241

citations

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

CVPR 2025arXiv

111

citations

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

COLM 2025arXiv

citations

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

NEURIPS 2020arXiv

citations

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

ICLR 2025

citations

Container: Context Aggregation Networks

NEURIPS 2021

citations

Jiasen Lu

Affiliations

papers (11)

12-in-1: Multi-Task Vision and Language Representation Learning

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Spatially Aware Multimodal Transformers for TextVQA

One Diffusion to Generate Them All

STIV: Scalable Text and Image Conditioned Video Generation

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

Container: Context Aggregation Networks

papers (11)

12-in-1: Multi-Task Vision and Language Representation Learning

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Spatially Aware Multimodal Transformers for TextVQA

One Diffusion to Generate Them All

STIV: Scalable Text and Image Conditioned Video Generation

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

Container: Context Aggregation Networks