Large Language Models (LLMs) effectively process variable-length text by training directly on native data formats. This inherent adaptability inspires a critical question for image synthesis: Can diffusion models achieve similar flexibility, learning to generate images directly at their diverse, native resolutions and aspect ratios? Conventional diffusion models exhibit significant challenges in generalizing across resolutions beyond their training regime. This limitation stems from three core difficulties: (1) Strong coupling between fixed receptive fields in convolutional architectures and learned feature scales. (2) Fragility of positional encoding and spatial coordinate dependencies in transformer architectures. (3) Inefficient and Unstable Training Dynamics from Variable Inputs.
We overcome these limitations by proposing a novel architecture for diffusion transformers that directly models native-resolution image data for generation. Drawing inspiration from the variable-sequence nature of Vision Transformers, we reformulate image generative modeling within diffusion transformers as "native-resolution generation". And we present the Native-resolution diffusion Transformer (NiT), which demonstrates the capability to generate images across a wide spectrum of resolutions and aspect ratios.