missing docuementation for encode for image embedding models
KennethEnevoldsen opened this issue · 2 comments
I can't seem to find the documentation for encode
when encoding images:
model = SentenceTransformer('clip-ViT-B-32') #Load CLIP model
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) # no documentation for this step
I am asking for this because we want to make a compatible interface for image embeddings for mteb
.
We are also working on the multimodal interface (e.g. for models like https://huggingface.co/TIGER-Lab/VLM2Vec-Full).
Hello!
Indeed, this is not documented very nicely because I'm considering deprecating the current CLIPModel module in favor of making the much more common Transformer module multimodal.
I did some experiments with this today, and I think there's potential. We would move towards AutoProcessing
instead of AutoTokenizer
. We can then feed the tokenizer/processor/feature extractor, etc., with whatever inputs the user has, and then feed that directly into the model.
We do then have to be careful what the model returns. For text-based models, we always grab the last_hidden_state
and then do Pooling in a separate pooler module, but with multi-modal systems (CLIP, CLAP) it seems to be more common to rely on the model's own pooling. This certainly simplifies things as we otherwise have to feed multiple token/patch embeddings to the pooler, sometimes even with different dimensionalities, etc.
I have to be quite wary as I rely on transformers
fully here.
Either way, the interface will always remain the same, regardless of how it's implemented behind the scenes, and your snippet is correct, you can pass PIL.Image
instances to model.encode
.
- Tom Aarsen