Minimal working example
Closed this issue · 2 comments
Thank you guys for the amazing job and for releasing FashionViL model.
I would like to use such a model in an image-to-text retrieval setting, but I am not capable of extracting the features from texts and from images.
Could you please provide a minimal working example where you have as inputs a text and an image and as output their features?
In other words, I'm asking for a snippet similar to the following one (but instead of using clip I would like to use FashionViL)
model, preprocess = clip.load('RN50')
image = preprocess(PIL.Image.open('dog.jpg')).unsqueeze(0)
text = 'a photo of a dog'
tokenized_text = clip.tokenize(text)
image_features = clip.encode_image(image)
text_features = clip.encode_text(tokenized_text)
Thanks again for the amazing work
Hi, sorry for the late reply.
I think the output_dict
in the following forward function is what you need.
mmf/mmf/models/fashionvil/contrastive.py
Lines 32 to 57 in d63a31f
I assume this issue has been solved. Closed.