not accurate in the transformer implementation comment

Question

not accurate in the transformer implementation comment

applepieiris opened this issue 2 years ago · 0 comments

I read about the tutorial of using Transformer to translate the pt to en in tutorial. In the data pipeline construction part :

MAX_TOKENS=128
def prepare_batch(pt, en):
    pt = tokenizers.pt.tokenize(pt)      # Output is ragged.
    pt = pt[:, :MAX_TOKENS]    # Trim to MAX_TOKENS.
    pt = pt.to_tensor()  # Convert to 0-padded dense Tensor

    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS+1)]
    en_inputs = en[:, :-1].to_tensor()  # Drop the [END] tokens
    en_labels = en[:, 1:].to_tensor()   # Drop the [START] tokens

    return (pt, en_inputs), en_labels

the comment which is attached to the line en_inputs = en[:, :-1].to_tensor() is not precise and may cause confusion among beginners. If the english sentence has more than MAX_TOKENS tokens, then the last token of the en is not [END] (it has been sliced). So in this situation, it's not right to say en_inputs = en[:, :-1].to_tensor() is used to drop the [END] tokens. I am stuck in this problem when I am dealing with my own training data. The actual case is whatever the length the sequence has, we only take the [MAX_TOKENS] tokens. It means that in some cases, the target sequence which we input to the decoder may not have [END] in the end of the sequence.