by @nousr
Invert CLIP text embeds to image embeds and visualize them with Deep Image Prior
.
An oil painting of mountains, in the style of monet
The following command will download all weights and run a prediction with your inputs inside a proper docker container.
cog predict r8.im/laion-ai/deep-image-diffusion-prior \
-i prompt=... \
-i offset_type=... \
-i num_scales=... \
-i input_noise_strength=... \
-i lr=... \
-i offset_lr_fac=... \
-i lr_decay=... \
-i param_noise_strength=... \
-i display_freq=... \
-i iterations=... \
-i num_samples_per_batch=... \
-i num_cutouts=... \
-i guidance_scale=... \
-i seed=...
Or you can use the jupyter notebook
-
LAION for support, resources, and community
-
@RiversHaveWings for making me aware of this technique
-
Stability AI for compute which makes these models possible
-
lucidrains for spearheading the open-source replication of DALLE 2
See the world "through CLIP's eyes" by taking advantage of the diffusion prior
as replicated by Laion to invert CLIP "ViT-L/14" text embeds to image embeds (as in unCLIP/DALLE2). After, a process known as deep-image-prior
developed by Katherine Crowson is run to visualize the features in CLIP's weights corresponding to activations from your prompt.
Just to avoid any confusion, this research is a recreation of (one part of) OpenAI's DALLE2 paper. It is not, "DALLE2", the product/service from OpenAI you may have seen on the web.
These visualizations can be quite abstract compared to other text-2-image models. However, you can often find a sort of dream like quality due to this. Many outputs are artistically fantastic because of this, but whether or not the visual matches your prompt as often is another matter.