Native-Resolution Image Synthesis

1MMLab CUHK 2Shanghai AI Lab
*Correspondance
Synthetic Data Generation

Figure 1: A single NiT model can generate images across diverse, arbitrary resolutions (from 256x256 to 2048x2048) and aspect ratios (from 1:5 to 3:1).

Synthetic Data Generation

Figure 2: (a) ImageNet resolutions are mainly concentrated between 200 to 600 pixels (width/height), with sparse data beyond 800 pixels. Despite this, (b) shows our NiT's superior generalization to unseen high resolutions (e.g., 1024, 1536), evidenced by significantly lower FID scores. (c) further confirms NiT also exhibits the strongest generalization across various aspect ratios.

🚀 Key Contributions

🔮

New visual generative paradigm

Introduce native-resolution image synthesis, a novel generative modeling paradigm capable of synthesizing images at arbitrary resolutions and aspect ratios.

🔍

Native-resolution Modeling Architecture

Propose Native-resolution diffusion Transformer (NiT), an architecture designed for explicitly modeling varying resolutions and aspect ratios within its denoising process.

🏆

State-of-the-art Performance

A single NiT model simultaneously achieves the SOTA performance on both ImageNet-256x256 (2.03 FID) and 512x512 (1.45 FID) benchmarks. NiT highlights its strong zero-shot generalization ability (e.g., 4.52 FID on unseen 1024x1024 resolution).

Qualitative Showcase

Introduction

Large Language Models (LLMs) effectively process variable-length text by training directly on native data formats. This inherent adaptability inspires a critical question for image synthesis: Can diffusion models achieve similar flexibility, learning to generate images directly at their diverse, native resolutions and aspect ratios? Conventional diffusion models exhibit significant challenges in generalizing across resolutions beyond their training regime. This limitation stems from three core difficulties: (1) Strong coupling between fixed receptive fields in convolutional architectures and learned feature scales. (2) Fragility of positional encoding and spatial coordinate dependencies in transformer architectures. (3) Inefficient and Unstable Training Dynamics from Variable Inputs.

We overcome these limitations by proposing a novel architecture for diffusion transformers that directly models native-resolution image data for generation. Drawing inspiration from the variable-sequence nature of Vision Transformers, we reformulate image generative modeling within diffusion transformers as "native-resolution generation". And we present the Native-resolution diffusion Transformer (NiT), which demonstrates the capability to generate images across a wide spectrum of resolutions and aspect ratios.

NiT: Native-resolution diffusion Transformer

NiT introduces three key architectural innovations

1

Dynamic Tokenization: Converts images in native resolution into variable-length token sequences and the tuples of corresponding height and width. Without requiring input padding, it avoids substantial computational overhead.

2

Variable-Length Sequence Processing: We use Flash Attention to natively process heterogeneous, unpadded token sequences by cumulative sequence lengths using the memory tiling strategy.

3

2D Structural Prior Injection: We introduce the axial 2D Rotary Positional Embedding to factorize the height and width impact and maximize the 2D structural prior by relative positional encoding.

Synthetic Data Generation

Experiments

NiT improves training efficiency and generation quality

Distinguishability Plot

A single NiT model can compete on both two benchmarks. All these baselines are resolution-expert methods, independently training two models for the two benchmarks. To the best of our knowledge, this is the first time a single model can compete on these two benchmarks simultaneously. NiT-XL achieves the best FID 1.45 on the 512x512 benchmark, outperforming the previous SOTA EDM2-XXL with half of the model size. On the 256x256 benchmark, our model surpasses the DiT-XL and FiTv2-XL models on FID with the same model size as well as outperforms the LlamaGen-3B model with much smaller parameters.

NiT improves generalization across diverse resolutions and aspect ratios

Distinguishability Plot

NiT-XL significantly surpasses all the baselines on resolution generalization. Remarkably, NiT-XL demonstrates almost no performance degradation when scaling to unseen higher resolutions.

Distinguishability Plot

NiT-XL can generalize to arbitrary aspect ratios, greatly outperforming all the baselines.

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:

@article{wang2025native,
    title={Native-Resolution Image Synthesis}, 
    author={Wang, Zidong and Bai, Lei and Yue, Xiangyu and Ouyang, Wanli and Zhang, Yiyuan},
    year={2025},
    eprint={2506.03131},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}