k2-fsa/icefall

Multi Lingual model

AlexandderGorodetski opened this issue · 1 comments

Hello guys,

I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.

I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.

And of course that I will increase number of tokens from 500 to 1000.

Is all this correct?

Thanks a lot,
AlexG.