Modify GISTEmbedLoss and CachedGISTEmbedLoss to automatically remove duplicate positives from being considered negatives

Hello!

The (Cached)GISTEmbedLoss classes mask away certain in-batch negatives as they might actually be positives right here:

sentence-transformers/sentence_transformers/losses/CachedGISTEmbedLoss.py

Lines 263 to 269 in f012ab3

    
           # Find which samples cannot be used as negatives because they are 
        
           # more similar to the query than the assigned positive as deemed by the guide model. 
        
           # For these samples, we mask them with -inf to basically ignore their contribution to 
        
           # the loss. 
        
           ap_sim[guided_ap_sim > guided_sim] = -torch.inf 
        
           aa_sim[guided_aa_sim > guided_sim] = -torch.inf 
        
           pp_sim[guided_pp_sim > guided_sim] = -torch.inf

and here:

sentence-transformers/sentence_transformers/losses/GISTEmbedLoss.py

Lines 132 to 138 in f012ab3

    
           # Find which samples cannot be used as negatives because they are 
        
           # more similar to the query than the assigned positive as deemed by the guide model. 
        
           # For these samples, we mask them with -inf to basically ignore their contribution to 
        
           # the loss. 
        
           ap_sim[guided_ap_sim > guided_sim] = -torch.inf 
        
           aa_sim[guided_aa_sim > guided_sim] = -torch.inf 
        
           pp_sim[guided_pp_sim > guided_sim] = -torch.inf

However, consider a scenario with (anchor, positive) pairs, where the same positive text occurs multiple times in the batch. This is quite bad, as this sample is now used both as a positive and as an in-batch negative. However, (Cached)GISTEmbedLoss should be able to detect this, as the guided_pp_sim and the guided_sim will be identical here. So, I think we can safely replace pp_sim[guided_pp_sim > guided_sim] = -torch.inf with pp_sim[guided_pp_sim >= guided_sim] = -torch.inf to automatically prevent duplicate positives from being labeled as in-batch negatives.

I haven't made this PR myself yet, because it will require some experimentation/testing to see if this doesn't accidentally hurt performance. However, conceptually it should only improve models.

cc @avsolatorio

Tom Aarsen

There may be some things I don't know well, but shouldn't I delete the duplicate between Anchors and also delete the duplicate between Anchors and Positives?

Removing duplicate anchors might indeed be smart: if they're included in the same batch, then the positive from the other (but identical) anchor will be used as a negative right now.

Duplicates between anchors and positives shouldn't matter too much I think: the loss for that sample is 0, so it won't learn from it.

Tom Aarsen

	# Find which samples cannot be used as negatives because they are
	# more similar to the query than the assigned positive as deemed by the guide model.
	# For these samples, we mask them with -inf to basically ignore their contribution to
	# the loss.
	ap_sim[guided_ap_sim > guided_sim] = -torch.inf
	aa_sim[guided_aa_sim > guided_sim] = -torch.inf
	pp_sim[guided_pp_sim > guided_sim] = -torch.inf