glample/fastBPE

question about applybpe

teslacool opened this issue · 4 comments

Hi,

I have a question that why we need to provide a vocab when applying bpe to valid/test data set?

This is not necessary, but suggested (especially in the cross-lingual setting, and when you do not share the lookup tables), because otherwise you may have unknown words, or words that are not unknown but that will be never trained.

For instance, let's say that the word "Obama" is in your English training corpus but never in your French training corpus, and that you train a French->English MT system.

If you learn BPE codes on the concatenated English + French corpora, the model will learn the BPE codes to construct "Obama", which means that if you run applybpe on a sentence that contains this word, it will result as "Obama".

Now, let's say that you have "Obama" in your French test set (but not in your French training set). Then the French encoder will try to encode "Obama", but this word is not in your French training vocabulary, so it will be an unknown word. If your use different lookup tables / vocabulary for English and French, then this word will not exist at all in your French lookup table so it will fail. If you share the vocabulary (as in the code) then you will have an embedding for Obama in your common lookup table (because it will be there in English), but this embedding will never be trained because you never saw it in French. So it will fail silently on these words. This is why it is important to specify the vocabulary when you apply the BPE codes, this way in the above example Obama will be mapped to something like "Oba@@ ma" where both "Oba@@" and "ma" have a trained embedding in the French lookup table.

It's a bit tricky, tell me if it's not clear.

Thanks for your detailed reply.

From your explanation, I think this is necessary when we use the same bpe codes to preprocess our train/valid/test data but the different vocab/ lookup tables. If we also use the same lookup table(e.g., shared embedding in NMT systems), this is not necessary. And if we use different bpe codes in previous step, this is also not necessary. Am I correct?

If you have different BPE codes for each language, it is less necessary, but there can still be problematic examples (even if those are very rare and probably wouldn't affect performance in any way). For safety I would recommend always using the vocab option.

Thanks, I get it.