Dynamic Diffusion Transformer

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for image generation,has demonstrated superior performance but suffers from substantial computationalcosts. Our investigations reveal that these costs stem from the static inferenceparadigm, which inevitably introduces redundant computation in certain diffusiontimesteps and spatial regions. To address this inefficiency, we propose DynamicDiffusion Transformer (DyDiT), an architecture that dynamically adjusts its compu-tation along both timestep and spatial dimensions during generation. Specifically,we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts modelwidth conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessaryspatial locations. Extensive experiments on various datasets and different-sizedmodels verify the superiority of DyDiT. Notably, with <3% additional fine-tuning it-erations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generationby 1.73×, and achieves a competitive FID score of 2.07 on ImageNet.