fartashf/vsepp

questions on dataset construction

muaz1994 opened this issue · 3 comments

Hi. Thanks for your code.
1- May I ask why are you including the start and end token when constructing the caption? Since you want to encode the caption only, there is no need for it. As far as I know, the start and end token are only needed when predicting text (such as image captioning, neural machine transaltion....etc). But for your case, you just want to encode. Or does it have to do with how the evaluation metric is calculated?

2- I also have a question about the data loader. In this part:

        if self.images.shape[0] != self.length:
            self.im_div = 5
        else:
            self.im_div = 1
        # the development set for coco is large and so validation would be slow
        if data_split == 'dev':
            self.length = 5000

I understand that for the training and test splits, you are replicating the image 5 times (the number of captions per image). However, for the 'dev' split (validation after training), you are specifying 5000 only. For Flickr30k, that would still be correct (since we have 1000 validation images * 5). But for COCO, the actual validation dataset with the replication is 25K. But you are loading only a portion of it. According to how the data loader works, it will generate indices according to the length of the dataset specified in __len__. Therefore, for COCO dev set, it will generate 5000 indices, and with images[i//5], this is will retrieve only 1000 original COCO validation images. So my question, is that right to be done? What if the other samples are better? This would lead to a low validation score while it should be high.

About start and end tokens, I guess in the current implementation they are not really useful. They are useful if the implementation of sentence encoder can't handle variable lengths in a mini-batch or if it is not parallelized over a mini-batch. But it's not the case in this with the pytorch tools used.

It is true that the validation set can have effectively 1000 images in it. But the small validation set is only used for early stopping. So maybe as a result we don't choose the best model for test set.

@fartashf Thanks a lot for your reply. I have one more question and I would really appreciate a reply. I have a deeper network for embedding the visual and text features, based on your code (just added more layers). Consider it to be 3 layers for each modality. I am using a ReLU activation function in between all these layers, except for the last one which will be the input to the similarity function. I found out that the results are extremely horrible (0.1, 0.5, 0.3) for R@1, R@5, R@10. But it works well when I remove the ReLU activation between the layers. Any idea why non-linearity harms image-text retrieval?

Maybe consider these changes:

  • Try the sum loss first. Max loss is not always good at the beginning of the optimization.
  • Make sure initialization is done correctly. Weights are not too small or too large.
  • Make sure fine-tuning is not set.
  • Try adding layers to only one modality. See if the problem is with training both together.
  • Try adding normalization before each linear layer.