yoshitomo-matsubara/torchdistill

Why using `log_softmax` instead of `softmax`?

nguyenvulong opened this issue · 1 comments

Same question has been asked here and here . These repositories (I think you already know them) are other attempts to implement knowledge distillation algorithms.

Could you please explain why it used log_softmax instead of softmax?

def forward(self, student_output, teacher_output, targets=None, *args, **kwargs):
soft_loss = super().forward(torch.log_softmax(student_output / self.temperature, dim=1),
torch.softmax(teacher_output / self.temperature, dim=1))
if self.alpha is None or self.alpha == 0 or targets is None:
return soft_loss
hard_loss = self.cross_entropy_loss(student_output, targets)
return self.alpha * hard_loss + self.beta * (self.temperature ** 2) * soft_loss

Hi @nguyenvulong

See KLDivLoss in PyTorch document.

To avoid underflow issues when computing this quantity, this loss expects the argument input in the log-space. The argument target may also be provided in the log-space if log_target= True.

Also, please use Discussions above (instead of Issues) for questions.
As explained here, I want to keep Issues mainly for bug reports.