orpatashnik/StyleCLIP

Training test case question

jimb2834 opened this issue · 5 comments

Hello @orpatashnik - Great work!

I wanted to perform a test and create a new StyleGan2 model with CLIP. So simply adding "captions" like male, and female to the images, and learn how it works.

Example:

  • image001.png
  • image001.txt <-- inside is the keyword male or female

What qty of images do you suggest?
and
If I used only 10,000 is this enough to be able to locate the latent space or get some idea how it works?

Thanks!

Hi @jimb2834 ,
Thanks for you interest in our work :)
Can you please clarify what do you mean by creating a new StyleGAN model with CLIP?

Hi @orpatashnik - Yes your work is great

Yes, Here is a little background: In the past, I have experimented and created new GAN, SG, SG-ada, SG3 models for various use cases, like medical. Recently I became interested in Clip and how to train the models with a "keyword" or like in diffusion "captions" - So simply put I gathered a training set from FFHQ and labeled them "male" and "female" and want to use your training script to train a StyleCLIP model.

Thanks for reading this also

Hi @jimb2834 ,

StyleCLIP is a method that employs CLIP and StyleGAN for editing, we didn't fine-tune StyleGAN or changed its architecture.
So given an image, you can change its attributes by using text.

If you are only interested in controlling the gender, a possible solution can be to use one of StyleCLIP's methods (latent mapper or global directions). Then you can sample a random latent code, and shift it towards the target gender either with the global direction or a trained mapper. For both of these methods you don't need labeled data, as CLIP provides the guidance.

If you are interested to train your own GAN, maybe you can take a look here: https://github.com/JiauZhang/GigaGAN. This is a new GAN architecture that can be conditioned on text.

Hi @orpatashnik - Thank you again for the replies;

I see, but what baffles me is this and perhaps you could advise where my logic is flawed.

  • CLIP is a neural network trained on a variety of (image, text) pairs. i.e. the Image must be paired with "text"

image

In my eventual use case, we cannot use other(s) terminology aka "text" within our future model so need to train a new one matching our Images with our "text" - This then allows us the future ability to use the text "embedding" to isolate a feature.

My original "man/women" example may be confusing as it seems quite trivial and already solved, But my goal is the contrary. I need to train specific images with NEW terminology "text". So cannot seem to find a working example where someone has trained a GAN model with "keywords/text" other than in Diffusion models using BLIP/WD14 etc

Does this make sense?

Two GANs I am aware of that use text as input are:

  1. GigaGAN - https://mingukkang.github.io/GigaGAN/
  2. StyleGAN-T - https://sites.google.com/view/stylegan-t/

Both of them train a GAN with a dataset consistent of (text, image) pairs. None of them has official implementation, but you can find some non-official ones.