Deep Image Diffusion Prior

Invert CLIP text embeds to image embeds and visualize them with Deep Image Prior.

An oil painting of mountains, in the style of monet

Quick start (docker required)

Install docker
Install cog

The following command will download all weights and run a prediction with your inputs inside a proper docker container.

cog predict r8.im/laion-ai/deep-image-diffusion-prior \
  -i prompt=... \
  -i offset_type=... \
  -i num_scales=... \
  -i input_noise_strength=... \
  -i lr=... \
  -i offset_lr_fac=... \
  -i lr_decay=... \
  -i param_noise_strength=... \
  -i display_freq=... \
  -i iterations=... \
  -i num_samples_per_batch=... \
  -i num_cutouts=... \
  -i guidance_scale=... \
  -i seed=...

Or you can use the jupyter notebook

Special Thanks

LAION for support, resources, and community
@RiversHaveWings for making me aware of this technique
Stability AI for compute which makes these models possible
lucidrains for spearheading the open-source replication of DALLE 2

Intended use

See the world "through CLIP's eyes" by taking advantage of the diffusion prior as replicated by Laion to invert CLIP "ViT-L/14" text embeds to image embeds (as in unCLIP/DALLE2). After, a process known as deep-image-prior developed by Katherine Crowson is run to visualize the features in CLIP's weights corresponding to activations from your prompt.

Ethical considerations

Just to avoid any confusion, this research is a recreation of (one part of) OpenAI's DALLE2 paper. It is not, "DALLE2", the product/service from OpenAI you may have seen on the web.

Caveats and recommendations