two bugs in the code

Hi Vid,
As the email we discussed, I record the bug here.

Lines 71 to 72 in 2fb1754

    
           masked_lm_loss = loss_fct(prediction_scores.permute(0,2,1), masked_lm_labels) 
        
           return torch.mean(masked_lm_loss,1)

The sum loss should be divided by the number of [MASK]. In current situation, the number [MASK] is larger, the mean loss is larger. I am not confirmed whether the length of correct answer makes bias in test set.

bert-commonsense/main.py

Line 483 in 2fb1754

    
           loss = loss_1 + args.alpha_param * torch.max(torch.zeros(loss_1.size(),device=device),torch.ones(loss_1.size(),device=device)*args.beta_param + loss_1 - loss_2.mean())

superfluous mean()

Hi,
Thanks for your comments. For now, I will leave the code as it is, for reproducibility and consistency with the trained models. I will try to revisit and re-train the models in the future, when my computational resources allow it.

Hi Vid,
As the email we discussed, I record the bug here.

bert-commonsense/main.py

Lines 71 to 72 in 2fb1754

masked_lm_loss = loss_fct(prediction_scores.permute(0,2,1), masked_lm_labels)

return torch.mean(masked_lm_loss,1)

The sum loss should be divided by the number of [MASK]. In current situation, the number [MASK] is larger, the mean loss is larger. I am not confirmed whether the length of correct answer makes bias in test set.

bert-commonsense/main.py

Line 483 in 2fb1754

loss = loss_1 + args.alpha_param * torch.max(torch.zeros(loss_1.size(),device=device),torch.ones(loss_1.size(),device=device)*args.beta_param + loss_1 - loss_2.mean())

superfluous mean()

Hello, could you elaborate on this issue?

If I understand correctly what you mean is that an answer with more subtokens would have less chance to be selected as the right answer. So basically what you are suggesting is

 return torch.mean(masked_lm_loss,1) / number_of_masks

?

Tks in advance!

That's correct. And the second bug is one redundant ".mean()" after loss_2 at the end of the line.

It should be - loss_2 without .mean() because .mean() averages over the batch - which shouldn't happen at this stage.

return torch.mean(masked_lm_loss,1) / number_of_masks

@vid-koci @xiaoouwang
Just to clarify, should number_of_masks be calculated with torch.count_nonzero(masked_lm_labels==-1, axis=1)?

@mhillebrand I believe that it should be masked_lm_labels>-1 rather than masked_lm_labels==-1 as -1 are exactly the ones that we are not interested in and are ignored by the CrossEntropy loss.

Ah, of course. Line 72 should be:

return torch.sum(masked_lm_loss, 1) / torch.count_nonzero(masked_lm_labels>-1, axis=1)

Interesting. The accuracy drops by over 5% after fixing these two bugs.

	masked_lm_loss = loss_fct(prediction_scores.permute(0,2,1), masked_lm_labels)
	return torch.mean(masked_lm_loss,1)