This repository implements the idea of "caption upsampling" from DALL-E 3 with Zephyr-7B and gathers results with SDXL.
"Caption upsampling" is the $10 term for deriving a highly descriptive caption from a short caption. Here is an example:
Short: A bird scaring a scarecrow
Upsampled: A large, vibrant bird with an impressive wingspan swoops down from the sky, letting out a piercing call as it approaches a weathered scarecrow in a sunlit field. The scarecrow, dressed in tattered clothing and a straw hat, appears to tremble, almost as if it’s coming to life in fear of the approaching bird.
This is particularly useful in the context of text-to-image generation.
🌟 Update 23/10/2023: Got featured in this TLDR newsletter.
DALL-E 3 uses GPT-4 for upsampling the captions. This repository aims at providing an implementation with an open-source model that is capable of performing something similar but doesn't require you to pay for the usage. As such it makes use of the "zephyr-7b-alpha" model, fine-tuned from the mighty Mistral-7B model.
You can find the upsampled captions from the DrawBench (introduced in Imagen) benchmark dataset here: sayakpaul/drawbench.
Refer to the upsample_drawbench_captions.py
script for implementation details.
After the DrawBench prompts were "upsampled", the generate_images.py
script was used to generate images with the regular DrawBench prompts and the upsampled ones. You can find all the images here: sayakpaul/drawbench-sdxl.
This section presents results generated using the SDXL Refiner and Kandinsky V2.2. These were generated using the scripts from the additional_examples
directory.
-
Since SDXL uses CLIP, upsampled captions leading to more than 77 tokens will not be fully utilized. One way to remedy this would be to change the system prompt here so that the underlying generation model is more length-aware.
This repository uses the prompt template from the DALL-E 3 technical report (Appendix C).
-
DALL-E 3 conducts training on a recaptioned dataset where the captions were regenerated to be much more detailed using GPT-4. It then demonstrates the effectiveness of using detailed prompts during inference. However, existing works (as noted in here) show that it's possible to improve the generation quality of existing systems like SDXL with detailed prompts even when they weren't particularly trained on similar datasets with very detailed captions.
-
It's important to investigate the output of the language model that's producing the descriptive captions. This directly impacts the quality of the images. As mentioned above, the prompt template is the original one used in the DALL-E 3 report. However, different language models might respond differently to that template. So, figuring out which template gives the best output most of the time is crucial.
The core idea of using detailed prompts to improve the quality of the generated samples has been explored before. Readers are welcome to check out the following resources in this regard:
- "Better prompt engineering" section from this doc
- lllyasviel/Fooocus
Additionally, PixArt-Alpha shows that fine-tuning on a dataset with highly detailed captions can lead to substantial quality improvements.