Minimal working example

Thank you guys for the amazing job and for releasing FashionViL model.

I would like to use such a model in an image-to-text retrieval setting, but I am not capable of extracting the features from texts and from images.
Could you please provide a minimal working example where you have as inputs a text and an image and as output their features?

In other words, I'm asking for a snippet similar to the following one (but instead of using clip I would like to use FashionViL)

model, preprocess = clip.load('RN50')
image = preprocess(PIL.Image.open('dog.jpg')).unsqueeze(0)
text = 'a photo of a dog'
tokenized_text = clip.tokenize(text)

image_features = clip.encode_image(image)
text_features = clip.encode_text(tokenized_text)

Thanks again for the amazing work

Hi, sorry for the late reply.

I think the output_dict in the following forward function is what you need.

mmf/mmf/models/fashionvil/contrastive.py

Lines 32 to 57 in d63a31f

    
           def _forward(self, sample_list: Dict[str, Tensor]) -> Dict[str, Tensor]: 
        
               visual_embeddings, _, _ = self.bert.get_image_embedding( 
        
                   sample_list["image"], 
        
                   sample_list["visual_embeddings_type"], 
        
               ) 
        
               visual_embeddings = visual_embeddings.mean(dim=1) 
        
               visual_embeddings = self.norm_layer(visual_embeddings) 
        
               text_embeddings, _, _ = self.bert.get_text_embedding( 
        
                   sample_list["input_ids"], 
        
                   sample_list["segment_ids"], 
        
                   sample_list["input_mask"], 
        
               ) 
        
               # text_embeddings = text_embeddings[:, 0] 
        
               masks = sample_list["input_mask"] 
        
               text_embeddings = text_embeddings * masks.unsqueeze(2) 
        
               text_embeddings = torch.sum(text_embeddings, dim=1) / ( 
        
                   torch.sum(masks, dim=1, keepdim=True) 
        
               ) 
        
               text_embeddings = self.norm_layer(text_embeddings) 
        
               output_dict = { 
        
                   "scores": visual_embeddings, 
        
                   "targets": text_embeddings, 
        
               } 
        
               return output_dict

I assume this issue has been solved. Closed.

	def _forward(self, sample_list: Dict[str, Tensor]) -> Dict[str, Tensor]:
	visual_embeddings, _, _ = self.bert.get_image_embedding(
	sample_list["image"],
	sample_list["visual_embeddings_type"],
	)
	visual_embeddings = visual_embeddings.mean(dim=1)
	visual_embeddings = self.norm_layer(visual_embeddings)

	text_embeddings, _, _ = self.bert.get_text_embedding(
	sample_list["input_ids"],
	sample_list["segment_ids"],
	sample_list["input_mask"],
	)
	# text_embeddings = text_embeddings[:, 0]
	masks = sample_list["input_mask"]
	text_embeddings = text_embeddings * masks.unsqueeze(2)
	text_embeddings = torch.sum(text_embeddings, dim=1) / (
	torch.sum(masks, dim=1, keepdim=True)
	)
	text_embeddings = self.norm_layer(text_embeddings)

	output_dict = {
	"scores": visual_embeddings,
	"targets": text_embeddings,
	}
	return output_dict