Official PyTorch implementation of DiffiT: Diffusion Vision Transformers for Image Generation.
Code and pretrained DiffiT models will be released soon !
DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset !
In addition, DiffiT sets a new SOTA FID score of 2.22 on FFHQ-64 dataset !
We introduce a new Time-dependent Multihead Self-Attention (TMSA) mechanism that jointly learns spatial and temporal dependencies and allows for attention conditioning with finegrained control.
- [04.02.2024] 🔥 Updated manuscript now available on arXiv ! !
- [12.04.2023] 🔥 Paper is published on arXiv !
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 256x256 | 1.73 | 276.49 |
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 512x512 | 2.67 | 252.12 |
Model | Dataset | Resolution | FID-50K |
---|---|---|---|
DiffiT | CIFAR-10 | 32x32 | 1.95 |
DiffiT | FFHQ-64 | 64x64 | 2.22 |
@article{hatamizadeh2023diffit,
title={Diffit: Diffusion vision transformers for image generation},
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
journal={arXiv preprint arXiv:2312.02139},
year={2023}
}
Copyright © 2024, NVIDIA Corporation. All rights reserved.