Confused with image_features minus text_features

Question

Confused with image_features minus text_features

CharlesGong12 opened this issue 5 months ago · 2 comments

Hi thanks for your amazing work!
I am confused with the subtraction operation image_features minus text_features. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features? It seems that image features and text features are not in the same space.

Answer 1 · 2024-06-23T06:24:36.000Z

The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]

Answer 2 · 2024-06-24T16:11:56.000Z

Good eye! Thanks for your feedback! @CharlesGong12

Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.

Hope this helps. Please let me know if you have further question.