"diffusion transformer" Papers

42 papers found

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

Zhen Han, Zeyinzi Jiang, Yulin Pan et al.

ICLR 2025arXiv:2410.00086
43
citations

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Junyuan Deng, Xinyi Wu, Yongxing Yang et al.

CVPR 2025arXiv:2504.15159
3
citations

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

ICLR 2025oralarXiv:2408.06072
1409
citations

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia et al.

ICCV 2025arXiv:2510.09094
5
citations

Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision

Weicai Yan, Wang Lin, Zirun Guo et al.

ICLR 2025arXiv:2504.21423
7
citations

Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data

Hengyu Fu, Zehao Dou, Jiawei Guo et al.

ICLR 2025oralarXiv:2407.16134
3
citations

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Shuang Wu, Youtian Lin, Feihu Zhang et al.

NEURIPS 2025arXiv:2505.17412
37
citations

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Maksim Siniukov, Di Chang, Minh Tran et al.

ICCV 2025arXiv:2504.04010
3
citations

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Zhi Hou, Tianyi Zhang, Yuwen Xiong et al.

ICCV 2025arXiv:2503.19757
40
citations

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.

CVPR 2025arXiv:2412.18597
46
citations

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

Junjia Huang, Pengxiang Yan, Jiyang Liu et al.

ICCV 2025arXiv:2504.08291
6
citations

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025arXiv:2503.07027
75
citations

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

Huiwon Jang, Sihyun Yu, Jinwoo Shin et al.

CVPR 2025arXiv:2411.14762
4
citations

FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

Zeyu Zhang, Yiran Wang, Danning Li et al.

NEURIPS 2025oral

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

ICCV 2025arXiv:2504.16080
32
citations

IRASim: A Fine-Grained World Model for Robot Manipulation

Fangqi Zhu, Hongtao Wu, Song Guo et al.

ICCV 2025arXiv:2406.14540
22
citations

Language-Guided Image Tokenization for Generation

Kaiwen Zha, Lijun Yu, Alireza Fathi et al.

CVPR 2025arXiv:2412.05796
25
citations

LaVin-DiT: Large Vision Diffusion Transformer

Zhaoqing Wang, Xiaobo Xia, Runnan Chen et al.

CVPR 2025arXiv:2411.11505
20
citations

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

Gao Peng, Le Zhuo, Dongyang Liu et al.

ICLR 2025oral
9
citations

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

Yuxuan Bian, Ailing Zeng, Xuan Ju et al.

AAAI 2025paperarXiv:2407.21136
19
citations

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Shuwei Shi, Biao Gong, Xi Chen et al.

CVPR 2025arXiv:2412.05848
14
citations

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2025arXiv:2501.06187
40
citations

Neural-Driven Image Editing

Pengfei Zhou, Jie Xia, Xiaopeng Peng et al.

NEURIPS 2025arXiv:2507.05397
2
citations

OminiControl: Minimal and Universal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.

ICCV 2025highlightarXiv:2411.15098
225
citations

Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant, Ethan Weber, Jin Kyu Kim et al.

CVPR 2025highlightarXiv:2502.07785
14
citations

Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, Zhicheng Sun, Ningyuan Li et al.

ICLR 2025oralarXiv:2410.05954
227
citations

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Yitian Zhang, Long Mai, Aniruddha Mahapatra et al.

ICCV 2025arXiv:2503.08665
1
citations

ROSE: Remove Objects with Side Effects in Videos

Chenxuan Miao, Yutong Feng, Jianshu Zeng et al.

NEURIPS 2025arXiv:2508.18633
6
citations

Stable Flow: Vital Layers for Training-Free Image Editing

Omri Avrahami, Or Patashnik, Ohad Fried et al.

CVPR 2025arXiv:2411.14430
60
citations

TokMan:Tokenize Manhattan Mask Optimization for Inverse Lithography

Yiwen Wu, Yuyang Chen, Ye Xia et al.

NEURIPS 2025

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang, Junchao Liao, Menghao Li et al.

CVPR 2025arXiv:2407.21705
115
citations

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

Haoxuan Wang, Jinlong Peng, Qingdong He et al.

ICCV 2025arXiv:2503.09277
17
citations

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

ICCV 2025arXiv:2503.07598
181
citations

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Yichao Shen, Fangyun Wei, Zhiying Du et al.

NEURIPS 2025arXiv:2512.06963
5
citations

VIRES: Video Instance Repainting via Sketch and Text Guided Generation

Shuchen Weng, Haojie Zheng, Peixuan Zhang et al.

CVPR 2025arXiv:2411.16199
1
citations

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

jian ma, Qirong Peng, Xu Guo et al.

ICCV 2025arXiv:2503.06134
5
citations

CasCast: Skillful High-resolution Precipitation Nowcasting via Cascaded Modelling

JUNCHAO GONG, LEI BAI, Peng Ye et al.

ICML 2024arXiv:2402.04290
46
citations

Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.

Zhuoyi Yang, Heyang Jiang, Wenyi Hong et al.

ECCV 2024arXiv:2405.04312
11
citations

Large Motion Model for Unified Multi-Modal Motion Generation

Mingyuan Zhang, Daisheng Jin, Chenyang Gu et al.

ECCV 2024arXiv:2404.01284
63
citations

Lazy Diffusion Transformer for Interactive Image Editing

Yotam Nitzan, Zongze Wu, Richard Zhang et al.

ECCV 2024arXiv:2404.12382
17
citations

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Junsong Chen, Chongjian GE, Enze Xie et al.

ECCV 2024
223
citations

SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic

Kashyap Chitta, Daniel Dauner, Andreas Geiger

ECCV 2024arXiv:2403.17933
22
citations