fartashf/vsepp

single caption query

wingz1 opened this issue · 4 comments

This code works quite well. Thanks for sharing it.
I'm wondering, do you have any code snippets to show how one might use a trained VSE++ model to create their own caption query from text (i.e. a string), submit it to the VSE++ model to get a single caption embedding and then search for matching images that have also mapped to the joint space using the same model?
It's easy to do the comparison once a numpy array for the caption and image embeddings in joint space are created, but it's not clear how to use your model with a brand new caption query or simply a set of CNN image features that are not part of some complete COCO/FLICKR/etc train or test set with corresponding caption/image pairs.
Thanks for any tips. I'd prefer not to rewrite everything if you already have some additional tools for this.

I don't have any particular script for that purpose. But you can look at the function encode_data to get an idea:

def encode_data(model, data_loader, log_step=10, logging=print):

encode_data gets the input from data_loader and encodes all images and captions given by that loader. It's probably easiest to write a special data loader class that handles your data. For that, take a look at data.py.

@wingz1 were you able to do it, any snippets or tips?

I am having a similar task in hand, I want to utilize COCO captions to extract top k images.

Yes, actually. I added a "def caption2emb( model, mycaption, vocab )" function to evaluate.py

@wingz1 Oh that's great to know, is it available anywhere to have a look? I would help me a lot.