Modify GISTEmbedLoss and CachedGISTEmbedLoss to automatically remove duplicate positives from being considered negatives
tomaarsen opened this issue · 2 comments
Hello!
The (Cached)GISTEmbedLoss classes mask away certain in-batch negatives as they might actually be positives right here:
sentence-transformers/sentence_transformers/losses/CachedGISTEmbedLoss.py
Lines 263 to 269 in f012ab3
and here:
sentence-transformers/sentence_transformers/losses/GISTEmbedLoss.py
Lines 132 to 138 in f012ab3
However, consider a scenario with (anchor, positive) pairs, where the same positive
text occurs multiple times in the batch. This is quite bad, as this sample is now used both as a positive and as an in-batch negative. However, (Cached)GISTEmbedLoss should be able to detect this, as the guided_pp_sim
and the guided_sim
will be identical here. So, I think we can safely replace pp_sim[guided_pp_sim > guided_sim] = -torch.inf
with pp_sim[guided_pp_sim >= guided_sim] = -torch.inf
to automatically prevent duplicate positives from being labeled as in-batch negatives.
I haven't made this PR myself yet, because it will require some experimentation/testing to see if this doesn't accidentally hurt performance. However, conceptually it should only improve models.
cc @avsolatorio
- Tom Aarsen
There may be some things I don't know well, but shouldn't I delete the duplicate between Anchors and also delete the duplicate between Anchors and Positives?
Removing duplicate anchors might indeed be smart: if they're included in the same batch, then the positive from the other (but identical) anchor will be used as a negative right now.
Duplicates between anchors and positives shouldn't matter too much I think: the loss for that sample is 0, so it won't learn from it.
- Tom Aarsen