UKPLab/sentence-transformers

Modify GISTEmbedLoss and CachedGISTEmbedLoss to automatically remove duplicate positives from being considered negatives

tomaarsen opened this issue · 2 comments

Hello!

The (Cached)GISTEmbedLoss classes mask away certain in-batch negatives as they might actually be positives right here:

# Find which samples cannot be used as negatives because they are
# more similar to the query than the assigned positive as deemed by the guide model.
# For these samples, we mask them with -inf to basically ignore their contribution to
# the loss.
ap_sim[guided_ap_sim > guided_sim] = -torch.inf
aa_sim[guided_aa_sim > guided_sim] = -torch.inf
pp_sim[guided_pp_sim > guided_sim] = -torch.inf

and here:

# Find which samples cannot be used as negatives because they are
# more similar to the query than the assigned positive as deemed by the guide model.
# For these samples, we mask them with -inf to basically ignore their contribution to
# the loss.
ap_sim[guided_ap_sim > guided_sim] = -torch.inf
aa_sim[guided_aa_sim > guided_sim] = -torch.inf
pp_sim[guided_pp_sim > guided_sim] = -torch.inf

However, consider a scenario with (anchor, positive) pairs, where the same positive text occurs multiple times in the batch. This is quite bad, as this sample is now used both as a positive and as an in-batch negative. However, (Cached)GISTEmbedLoss should be able to detect this, as the guided_pp_sim and the guided_sim will be identical here. So, I think we can safely replace pp_sim[guided_pp_sim > guided_sim] = -torch.inf with pp_sim[guided_pp_sim >= guided_sim] = -torch.inf to automatically prevent duplicate positives from being labeled as in-batch negatives.

I haven't made this PR myself yet, because it will require some experimentation/testing to see if this doesn't accidentally hurt performance. However, conceptually it should only improve models.

cc @avsolatorio

  • Tom Aarsen

There may be some things I don't know well, but shouldn't I delete the duplicate between Anchors and also delete the duplicate between Anchors and Positives?

Removing duplicate anchors might indeed be smart: if they're included in the same batch, then the positive from the other (but identical) anchor will be used as a negative right now.

Duplicates between anchors and positives shouldn't matter too much I think: the loss for that sample is 0, so it won't learn from it.

  • Tom Aarsen