caption-upsampling

This repository implements the idea of "caption upsampling" from DALL-E 3 with Zephyr-7B and gathers results with SDXL.


A white colored sandwich.	A white car and a red sheep.	A side view of an owl sitting in a field.

A white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.	A white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.	A regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.

_{Explore more samples here. Find additional examples below with SDXL Refiner and Kandinsky V2.2.}

"Caption upsampling" is the $10 term for deriving a highly descriptive caption from a short caption. Here is an example:

Short: A bird scaring a scarecrow

Upsampled: A large, vibrant bird with an impressive wingspan swoops down from the sky, letting out a piercing call as it approaches a weathered scarecrow in a sunlit field. The scarecrow, dressed in tattered clothing and a straw hat, appears to tremble, almost as if it’s coming to life in fear of the approaching bird.

This is particularly useful in the context of text-to-image generation.

🌟 Update 23/10/2023: Got featured in this TLDR newsletter.

Why does this repo exist?

DALL-E 3 uses GPT-4 for upsampling the captions. This repository aims at providing an implementation with an open-source model that is capable of performing something similar but doesn't require you to pay for the usage. As such it makes use of the "zephyr-7b-alpha" model, fine-tuned from the mighty Mistral-7B model.

You can find the upsampled captions from the DrawBench (introduced in Imagen) benchmark dataset here: sayakpaul/drawbench.

Refer to the upsample_drawbench_captions.py script for implementation details.

Images with and without caption upsampling

After the DrawBench prompts were "upsampled", the generate_images.py script was used to generate images with the regular DrawBench prompts and the upsampled ones. You can find all the images here: sayakpaul/drawbench-sdxl.

Additional examples

This section presents results generated using the SDXL Refiner and Kandinsky V2.2. These were generated using the scripts from the additional_examples directory.

SDXL Refiner


A white colored sandwich.	A white car and a red sheep.	A side view of an owl sitting in a field.

A white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.	A white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.	A regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.

_{Explore more samples here.}

Kandinsky V2.2


A white colored sandwich.	A white car and a red sheep.	A side view of an owl sitting in a field.

A white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.	A white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.	A regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.

_{Explore more samples here.}

Limitations ⛔️

Since SDXL uses CLIP, upsampled captions leading to more than 77 tokens will not be fully utilized. One way to remedy this would be to change the system prompt here so that the underlying generation model is more length-aware.

This repository uses the prompt template from the DALL-E 3 technical report (Appendix C).
DALL-E 3 conducts training on a recaptioned dataset where the captions were regenerated to be much more detailed using GPT-4. It then demonstrates the effectiveness of using detailed prompts during inference. However, existing works (as noted in here) show that it's possible to improve the generation quality of existing systems like SDXL with detailed prompts even when they weren't particularly trained on similar datasets with very detailed captions.
It's important to investigate the output of the language model that's producing the descriptive captions. This directly impacts the quality of the images. As mentioned above, the prompt template is the original one used in the DALL-E 3 report. However, different language models might respond differently to that template. So, figuring out which template gives the best output most of the time is crucial.

Notes

The core idea of using detailed prompts to improve the quality of the generated samples has been explored before. Readers are welcome to check out the following resources in this regard:

"Better prompt engineering" section from this doc
lllyasviel/Fooocus

Additionally, PixArt-Alpha shows that fine-tuning on a dataset with highly detailed captions can lead to substantial quality improvements.

p1atdev/caption-upsampling