/DALL-E-Explained

Description and applications of OpenAI's paper about DALL-E (2021) and implementation of other (CLIP-guided) zero-shot text-to-image generation schemes

Primary LanguageJupyter NotebookMIT LicenseMIT

DALL-E-Explained

Description and applications of OpenAI's paper about the DALL-E model and implementation of a text-to-image generation scheme using CLIP.

What is this notebook about?

The primary reason of the notebook in this repository is to give a brief explanation about OpenAI's Zero Shot Text-to-Image Generation paper (1) where they introduce DALL-E, a deep-leaning model to generate images directly from a text-prompt. I will also showcase some of the outputs that can be accomplished with the model described in their paper and walkthrough how you can generate your own images from text-captions (althought using a different methology than the one described in the paper).

Open In Colab Binder


What is DALL-E and Zero-Shot Text-to-Image Generation?

On January 5th of 2021, OpenAI released a blog post introducing their new deep learning model DALL-E[1], a transformer language model trained to generate images from text captions with precise coherence. A few months after, they published the paper Zero-Shot Text-to-Image Generation describing their approach with creating this model along with code to replicate the discrete Variational Auto Encoder (dVAE) used in their research.

Zero-Shot Text-to-Image generation refers to the concept of generating an image from a text input in a way that makes the image consistent with the text. If the prompt "A giraffe wearing a red scarf" is given then one would expect the output to be an image that assimilates a giraffe with a red piece of cloth aroung its neck. The Zero-Shot part comes from the fact that the model wasn't explicitly trained with a fixed set of text-prompts meaning that it can, in principle, generalize to any text input (with mixed degrees of performance).

How does DALLE work?

DALL-E is a language model that is at its core an autoregressive network with 12 billion parameter trained on 250 million image-text pairs. In the paper they explained the methodology used to make this model by dividing it into two parts to describe the two stages of learning they had to model:

  • The first part was about learning the vocabulary of the image-text pairs. What they did is to train a discrete Variational Auto-Encoder (VAE) to compress the 256x256x3 training images into 32x32 grids of discrete image tokens of vocabulary size 8192. That is, they learnt to map and reconstruct an image to and from a embedding (or latent) space of 32*32=1024 integers (image tokens).
VQ-VAE example DALL-E dvae reconstruction example
Example of a VQ-VAE taken from Van den Oord et al. 2017 [2] Reconstruction of an original image by DALL-E's dVAE
  • The second part was about learning the prior distribution over the text and image tokens. What they did here is concatenate 256 tokens obtained from encoding the input text prompts with the encoded 1024 tokens from their correponding image and training a transformer to model this autoregressively as a single stream of data of 1024+256 = 1080 tokens. The result is that from an initial set of at least 256 tokens, the model will "autocomplete" the remaining ones such that an image is generated that is consistent to the initial tokens [3].

In summary, with the dVAE from the first stage and the autoregressive transformer from the second one, a single step of DALL-E would have to (1) use the transformer to predict the following 1024 image tokens from the first 256 tokens obtained from the input text-prompt and (2) take the full stream of 1024 image tokens that are generated by the transformer and generate an image using the dVAE to map from the embedding space onto the image space.

DALL-E Results

The results published in their blog and paper show an extremely good capability of generating completely new images that are coherent to the input text prompt. The model is also capable of reconstructing images that have their bottom part missing or understanding the relationship between a given top image and generating a new image from it at the bottom.

armchair in the shape of an avocado text-to-image examples
exact same cat at the top as a sketch in the bottom Bust of Homer

[1] The name DALL-E comes from a wordplay combining WALL-E, the Disney's Pixar character, and Dalí from Salvador Dalí, the famous spanish painter.

[2] Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. "Neural discrete representation learning." (2017) [Link]

[3] This is similar to what GTP-3 (another language model by OpenAI) does to generate text from an initial text-input. Although GTP-3 is more than 10 times larger than DALL-E with 175 billion parameters (Source).


Implementation of another Text-to-Image Generation Scheme using OpenAI's CLIP

Even though a lot of people would love to play with DALL-E and/or see more of it in action, OpenAI hasn't fully released it to the public yet and they sadly haven't expressed any plans to do so in the nearby future. What they did do is release the dVAE described in the first stage of their paper. But, even thought it can be used to map and reconstruct existing images to and from its latent space, is missing the important part that is actually able to represent text as images (the transformer).

Additionall, for most people and companies it is prohibitely expensive to attempt to train a model as large as DALL-E for themselves (would cost more than a hundred thousands of dollars to train such model!). Because of that and until they release the full model (if ever), we are bound to look or come up with other schemes that are able to do text-to-image generation in a different way.

Ryan Murdoch is one that has come up with a simple scheme to accomplish this. He invented a method to combine OpenAI's own CLIP with any image generative model (like DALL-E's dVAE), to generate images from any text prompt.

Text-to-Image generation with CLIP

What is CLIP?

CLIP was introduced by OpenAI in another blog post the same day that they introduced DALL-E. CLIP is a neural network that is extremely good at telling whether an image and a text label fit together, that is, given an image and a any set of text labels, CLIP will output how likely each label is to be representative of the image. So if you show CLIP an image of a cat and the labels ["a dog","a giraffe","a house", "a cat"] it will assign more probability to the labels related to the cat picture (a cat in this case).

CLIP is really good at telling you whether an image fits a text label* It is fully differentiable*

The beauty about CLIP is that the network is fully differentiable and therefore if we have a generator that feeds every image that it creates to CLIP and define our loss function as obtaining a high value from it, the "error" between the given label(s) and image can be backpropagated through the generator model to incrementally get closer and closer to an image that CLIP recognizes as one that assimilates the text label. So if we start with any image obtained from the generator (it can be random, or just noise) we just need to traverse through the embedding space in the direction that minimizes CLIP's error until we get to an image that is good enough at emulating the text (by CLIP standards).

Backpropagating through CLIP and the generator network*

In the notebook in this repository you will find the implementation of the methodology described above to do Zero-Shot Text-to-Image generation. Most of the code in it was adapted from other notebooks published by Ryan Murdoch (4). I've only expanded on the ways that the outputs are visualized and integrated the implementations of two different Generators into one notebook, such that is possible to choose between the dVAE that is used by DALL-E (5) and a VQGAN created by CompVis that uses Taming Transformers (3)

The following are some examples of media I've been able to generate with this method:

Input: "A city landscape in the style of Van Gogh" (DALL-E dVAE) Input: Selfie of me + "A cat" (progression video, VQGAN)
cityscape in the style of Van Gogh me+cat = gif

While the results of using this method don't seem to be as realistic as what DALL-E does they are still very impresive (and extremely fun to play with!).


Credits and References:

  1. Zero-Shot Text-to-Image Generation: https://paperswithcode.com/paper/zero-shot-text-to-image-generation (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever)

  2. OpenAI CLIP: https://github.com/openai/CLIP (Alec Radford, Jong Wook Kim,Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever)

  3. CompVis Taming Transformers: https://github.com/CompVis/taming-transformers (Patrick Esser, Robin Rombach, Bjorn Ommer)

  4. Ryan Murdoch's work (@advadnoun on Twitter). Most of the code implementations here are taken and/or adapted from some of his notebooks.

  5. OpenAI DALL-E's dVAE: https://github.com/openai/DALL-E/ (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever)