StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Key points of the architecture

StyleGan-NADA allows to adapt the domain of a StyleGan2 generator to a new domain. It does so by minimizing the directional clip loss:

where E_T and E_I are the text and image encoders that the CLIP model provides. G_train is the new generator that StyleGan-NADA produces while G_frozen is the original generator that is kept witohut training.

Conceptually, it calculates a direction in CLIP space using text prompts and shifts the generator in CLIP space accordingly to that direction.

Not all layers of the G_frozen network are trained. A subset of layers is chosen based on how much they weight on the output. This is called adaptive layer freezing.

For more details, the original paper is avaiable here

Run and train the network

To train and run the newly generated network, a public accessible colab is avaiable here. It allows to select a model to adapt, insert source and target domains, train the network and use it to generate an arbitrary number of images.

Experiments and comparision

Some details of the implementation where changed. Here we present some results and comparision with the original model.

Additional work

The adaptive layer freezing approach was made scalable. This means that instead of computing the best k layers to train at every iteration it's done only every auto_layer_interval. Also every auto_layer_falloff the number of trained layer decreases, allowing for better fine tuning.
Global loss was reintroduced. The loss is now comuted as a weighted sum between Directional and Global Clip Loss. This can be adjusted via a slider in the colab.
The original paper uses a set of prompts generated from templates starting from the insterted prompts. I Experimented without this feature and concluded there are no major changes. I removed the feature by default but it's still possible to use templates.