fartashf/vsepp

Batch formation potentially causing false negatives

wpeebles opened this issue · 1 comments

Since for each image in MS-COCO there are 5 captions, I believe in data.py that when a batch is formed, it is possible that two or more of the images in that batch will be identical (they will just be paired with different captions). Since the ContrastiveLoss implementation assumes only the diagonal of the scores matrix represents scores for aligned images and captions, doesn't this mean it is possible for images and captions that are aligned in the dataset to be treated as unaligned when computing/ backpropagating the loss? Here is an example to illustrate this idea:

Consider a batch size of 128. Perhaps the 5th and 19th image selected in the batch are identical (the 5th and 19th captions selected are different, but describe the same image). In the scores matrix in the forward method of ContrativeLoss, the (5, 5) and (19, 19) entries will be correctly treated as scores for aligned embeddings. However, the (5, 19) and (19, 5) entries will be incorrectly treated as scores for unaligned embeddings.

Did I misunderstand anything with the code? If not, I believe this would affect the cost_s portion of ContrastiveLoss but not the cost_im portion.

This is a simplification that in practice does not hurt the training at the scale of MS-COCO.

Particularly, if you look at the loss over the whole mini-batch, the incorrect terms cancel out. In your example, the loss for (i_5, c_19) says I want image i_5 to be closer to c_5 than to c_19 and (i_19, c_5) which is in fact (i_5, c_5) says I want i_5 to be closer to c_19 than to c_5. The gradients from these terms are theoretically exactly the opposite of each other and would cancel out. This is true for both portions of the loss.

Such a simplification would hurt though if the probability of sampling such opposing pairs is high. In that case, the gradient from one mini-batch is accumulated over only a few effective examples and hence the variance of the estimate of the gradient would be high.

Just as a simple test, I tried using a mask to filter out such opposing terms but it did not help.

Besides, this simplification has been typically done in previous work. Take a look at these for example:
https://github.com/ryankiros/visual-semantic-embedding/blob/master/homogeneous_data.py
https://github.com/ryankiros/visual-semantic-embedding/blob/master/datasets.py