Confused with image_features minus text_features
CharlesGong12 opened this issue · 2 comments
Hi thanks for your amazing work!
I am confused with the subtraction operation image_features minus text_features
. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features
? It seems that image features and text features are not in the same space.
The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]
Good eye! Thanks for your feedback! @CharlesGong12
Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.
Hope this helps. Please let me know if you have further question.