ShihaoZhaoZSH/LaVi-Bridge

Use CLIP as text encoder

Espere-1119-Song opened this issue · 2 comments

Thanks for your great contribution to the Community.

I found that the experiment that uses CLIP as text encoder has been conducted in the paper, but I didn't find the corresponding code. Will you release the CLIP version code? I wonder how to deal with the linear layer of the attention layer in CLIP text encoder. Because it seems that the linear layer of the attention layer in CLIP is NonDynamicallyQuantizableLinear, not normal nn.Linear.

Thank you for your interest in our LaVi-Bridge! We will schedule the release of the code related to CLIP text encoder. In the meantime, you can refer to the test/t5_unet.py. The main difference is to switch the text encoder from transformers.T5EncoderModel and AutoTokenizer to transformers.CLIPTextModel and CLIPTokenizer. The pre-trained model is the "CompVis/stable-diffusion-v1-4" repository on Hugging Face. Additionally, you can refer to the standard Stable Diffusion 1.4 pipeline, which also utilizes CLIP as the language model.

Thanks a lot for your help! I will follow the instruction you provide, and really look forward to the release of CLIP Text encoder version :)