graph4ai/graph4nlp

BPE for the NMT example

smith-co opened this issue · 4 comments

Questions and Help

I see in the given example BPE is removed during evaluation:

 pred_collect.extend(self.remove_bpe(pred_str))

remove BPE code:

    def remove_bpe(self, str_with_subword):
        if isinstance(str_with_subword, list):
            return [self.remove_bpe(ss) for ss in str_with_subword]
        symbol = "@@ "
        return str_with_subword.replace(symbol, "").strip()

But where are you applying BPE during vocab building?

Yes. The raw data are tokenized by BPE.

@AlanSwift so for the given example raw data is already tokenized by BPE.

In other words, the current example expects the dataset to be tokenized by BPE. What would happen if the dataset is not tokenized by BPE?

@AlanSwift can you please help me with the query?

The current NMT example is using BPE since the BPE is a good practice and is widely used in the NMT community. Technically it is OK if you want to use the whole words as your vocabulary in your own application. However, your vocabulary will be large if you do it. What's more, the performance may be limited by the OOV(out of vocabulary problem).