BPE for the NMT example

Question

BPE for the NMT example

smith-co opened this issue 2 years ago · 4 comments

❓ Questions and Help

I see in the given example BPE is removed during evaluation:

 pred_collect.extend(self.remove_bpe(pred_str))

remove BPE code:

    def remove_bpe(self, str_with_subword):
        if isinstance(str_with_subword, list):
            return [self.remove_bpe(ss) for ss in str_with_subword]
        symbol = "@@ "
        return str_with_subword.replace(symbol, "").strip()

But where are you applying BPE during vocab building?

Answer 1 · 2022-04-10T04:11:35.000Z

Yes. The raw data are tokenized by BPE.

Answer 2 · 2022-04-19T15:10:33.000Z

@AlanSwift so for the given example raw data is already tokenized by BPE.

In other words, the current example expects the dataset to be tokenized by BPE. What would happen if the dataset is not tokenized by BPE?

Answer 3 · 2022-04-22T17:31:24.000Z

@AlanSwift can you please help me with the query?

Answer 4 · 2022-04-22T18:01:47.000Z

The current NMT example is using BPE since the BPE is a good practice and is widely used in the NMT community. Technically it is OK if you want to use the whole words as your vocabulary in your own application. However, your vocabulary will be large if you do it. What's more, the performance may be limited by the OOV(out of vocabulary problem).