An implementation of text-to-image via High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et. al with a focus on training on a TPU-v3-8 VM. Includes full training code for the VAE and DM with minimal changes from the original paper.
Non-cherry-picked generated outputs after 90k steps at batch size 1024, lr=1e-4, ~25 million img subset of laion2B-en: