Why using `log_softmax` instead of `softmax`?

Same question has been asked here and here . These repositories (I think you already know them) are other attempts to implement knowledge distillation algorithms.

Could you please explain why it used log_softmax instead of softmax?

torchdistill/torchdistill/losses/single.py

Lines 99 to 106 in 993ee94

    
           def forward(self, student_output, teacher_output, targets=None, *args, **kwargs): 
        
               soft_loss = super().forward(torch.log_softmax(student_output / self.temperature, dim=1), 
        
                                           torch.softmax(teacher_output / self.temperature, dim=1)) 
        
               if self.alpha is None or self.alpha == 0 or targets is None: 
        
                   return soft_loss 
        
               hard_loss = self.cross_entropy_loss(student_output, targets) 
        
               return self.alpha * hard_loss + self.beta * (self.temperature ** 2) * soft_loss

Hi @nguyenvulong

See KLDivLoss in PyTorch document.

To avoid underflow issues when computing this quantity, this loss expects the argument input in the log-space. The argument target may also be provided in the log-space if log_target= True.

Also, please use Discussions above (instead of Issues) for questions.
As explained here, I want to keep Issues mainly for bug reports.

	def forward(self, student_output, teacher_output, targets=None, args, *kwargs):
	soft_loss = super().forward(torch.log_softmax(student_output / self.temperature, dim=1),
	torch.softmax(teacher_output / self.temperature, dim=1))
	if self.alpha is None or self.alpha == 0 or targets is None:
	return soft_loss

	hard_loss = self.cross_entropy_loss(student_output, targets)
	return self.alpha * hard_loss + self.beta * (self.temperature ** 2) * soft_loss