BPE for the NMT example
smith-co opened this issue · 4 comments
❓ Questions and Help
I see in the given example BPE
is removed during evaluation:
pred_collect.extend(self.remove_bpe(pred_str))
remove BPE code:
def remove_bpe(self, str_with_subword):
if isinstance(str_with_subword, list):
return [self.remove_bpe(ss) for ss in str_with_subword]
symbol = "@@ "
return str_with_subword.replace(symbol, "").strip()
But where are you applying BPE
during vocab building?
Yes. The raw data are tokenized by BPE.
@AlanSwift so for the given example raw data is already tokenized by BPE.
In other words, the current example expects the dataset to be tokenized by BPE. What would happen if the dataset is not tokenized by BPE?
@AlanSwift can you please help me with the query?
The current NMT example is using BPE since the BPE is a good practice and is widely used in the NMT community. Technically it is OK if you want to use the whole words as your vocabulary in your own application. However, your vocabulary will be large if you do it. What's more, the performance may be limited by the OOV(out of vocabulary problem).