not accurate in the transformer implementation comment
applepieiris opened this issue · 0 comments
I read about the tutorial of using Transformer to translate the pt to en in tutorial. In the data pipeline construction part :
MAX_TOKENS=128
def prepare_batch(pt, en):
pt = tokenizers.pt.tokenize(pt) # Output is ragged.
pt = pt[:, :MAX_TOKENS] # Trim to MAX_TOKENS.
pt = pt.to_tensor() # Convert to 0-padded dense Tensor
en = tokenizers.en.tokenize(en)
en = en[:, :(MAX_TOKENS+1)]
en_inputs = en[:, :-1].to_tensor() # Drop the [END] tokens
en_labels = en[:, 1:].to_tensor() # Drop the [START] tokens
return (pt, en_inputs), en_labels
the comment which is attached to the line en_inputs = en[:, :-1].to_tensor()
is not precise and may cause confusion among beginners. If the english sentence has more than MAX_TOKENS
tokens, then the last token of the en
is not [END] (it has been sliced). So in this situation, it's not right to say en_inputs = en[:, :-1].to_tensor()
is used to drop the [END] tokens. I am stuck in this problem when I am dealing with my own training data. The actual case is whatever the length the sequence has, we only take the [MAX_TOKENS] tokens. It means that in some cases, the target sequence which we input to the decoder may not have [END] in the end of the sequence.