LaVin-DiT: Large Vision Diffusion Transformer

20citations

arXiv:2411.11505

citations

#401

in CVPR 2025

of 2873 papers

Top Authors

Data Points

Top Authors

Zhaoqing Wang Xiaobo Xia Runnan Chen Dongdong Yu Changhu Wang Mingming Gong Tongliang Liu

Topics

diffusion transformer vision foundation model spatial-temporal variational autoencoder in-context learning unified multi-task training generative vision tasks latent space encoding scalable vision model

Abstract

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.

Citation History

Jan 26, 2026

Feb 2, 2026

Feb 13, 2026

20+1

Feb 13, 2026