jadore801120/attention-is-all-you-need-pytorch

How does the gradients flow in cal_loss function in train.py?

InhyeokYoo opened this issue · 0 comments

Hi.

I'm implementing Label smoothing by referencing your code but I don't know how your cal_loss function make the gradient flow. How does a gradient flow via one-hot vector?

The below is my implementation. The gradient didn't flow in my model. What's is the problem of mine?

# reference: https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/train.py#L38
def cal_loss(pred, gt, ignore_idx, smoothing=None):
    '''
    Calculate cross-entropy loss between very before layer of softmax and ground truth(gt)
    param:
        pred: predicted value (before applying softmax)
        gt: grount truth
        ignore_idx: index that will be ignored when calculating its probability
        smoothing: if float is passed, then apply label smoothing
    
    shape:
        pred: [Batch_size, (Seq_len-1), Tgt_vocab_size]
        gt: [Batch_size, (Seq_len-1)]
    '''
    if smoothing != None:
        confidence = 1 - smoothing
        target_vocab_size = pred.size(2)

        # generate one-hot vector through the probability that model generates
        one_hot = torch.zeros_like(pred, requires_grad=True).scatter(2, gt.unsqueeze(2), 1) # [Batch_size, (Seq_len-1), Tgt_vocab_size]
        one_hot = one_hot * confidence + (1 - one_hot) * smoothing / (target_vocab_size - 2) # ['SOS', 'PAD']
        log_prob = nn.functional.softmax(one_hot, dim=2)
        
        non_pad_mask = gt.ne(ignore_idx) # where gt != ignore_index
        # NLL-Loss: $\sum_x log(p_{theta}(y \rvert x)) q(w)
        loss = - (one_hot * log_prob).sum(dim=2) # [Batch_size, (Seq_len-1)]
        loss = loss.masked_select(non_pad_mask).mean() # gather all values where mask == True and return 1-D array
    else:
        loss = nn.functional.cross_entropy(pred, gt, ignore_idx=ignore_idx, reduction='sum')
    return loss