missing docuementation for encode for image embedding models

Question

missing docuementation for encode for image embedding models

KennethEnevoldsen opened this issue a month ago · 2 comments

I can't seem to find the documentation for encode when encoding images:

model = SentenceTransformer('clip-ViT-B-32') #Load CLIP model
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) # no documentation for this step

I am asking for this because we want to make a compatible interface for image embeddings for mteb.

We are also working on the multimodal interface (e.g. for models like https://huggingface.co/TIGER-Lab/VLM2Vec-Full).

Answer 1 · 2024-12-05T16:39:09.000Z

Hello!

Indeed, this is not documented very nicely because I'm considering deprecating the current CLIPModel module in favor of making the much more common Transformer module multimodal.

I did some experiments with this today, and I think there's potential. We would move towards AutoProcessing instead of AutoTokenizer. We can then feed the tokenizer/processor/feature extractor, etc., with whatever inputs the user has, and then feed that directly into the model.

We do then have to be careful what the model returns. For text-based models, we always grab the last_hidden_state and then do Pooling in a separate pooler module, but with multi-modal systems (CLIP, CLAP) it seems to be more common to rely on the model's own pooling. This certainly simplifies things as we otherwise have to feed multiple token/patch embeddings to the pooler, sometimes even with different dimensionalities, etc.

I have to be quite wary as I rely on transformers fully here.

Either way, the interface will always remain the same, regardless of how it's implemented behind the scenes, and your snippet is correct, you can pass PIL.Image instances to model.encode.

Tom Aarsen