BrandonHanx/mmf

Minimal working example

Closed this issue · 2 comments

Thank you guys for the amazing job and for releasing FashionViL model.

I would like to use such a model in an image-to-text retrieval setting, but I am not capable of extracting the features from texts and from images.
Could you please provide a minimal working example where you have as inputs a text and an image and as output their features?

In other words, I'm asking for a snippet similar to the following one (but instead of using clip I would like to use FashionViL)

model, preprocess = clip.load('RN50')
image = preprocess(PIL.Image.open('dog.jpg')).unsqueeze(0)
text = 'a photo of a dog'
tokenized_text = clip.tokenize(text)

image_features = clip.encode_image(image)
text_features = clip.encode_text(tokenized_text)

Thanks again for the amazing work

Hi, sorry for the late reply.

I think the output_dict in the following forward function is what you need.

def _forward(self, sample_list: Dict[str, Tensor]) -> Dict[str, Tensor]:
visual_embeddings, _, _ = self.bert.get_image_embedding(
sample_list["image"],
sample_list["visual_embeddings_type"],
)
visual_embeddings = visual_embeddings.mean(dim=1)
visual_embeddings = self.norm_layer(visual_embeddings)
text_embeddings, _, _ = self.bert.get_text_embedding(
sample_list["input_ids"],
sample_list["segment_ids"],
sample_list["input_mask"],
)
# text_embeddings = text_embeddings[:, 0]
masks = sample_list["input_mask"]
text_embeddings = text_embeddings * masks.unsqueeze(2)
text_embeddings = torch.sum(text_embeddings, dim=1) / (
torch.sum(masks, dim=1, keepdim=True)
)
text_embeddings = self.norm_layer(text_embeddings)
output_dict = {
"scores": visual_embeddings,
"targets": text_embeddings,
}
return output_dict

I assume this issue has been solved. Closed.