Multi Lingual model

Question

Multi Lingual model

AlexandderGorodetski opened this issue 5 months ago · 1 comments

AlexandderGorodetski commented 5 months ago

Hello guys,

I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.

I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.

And of course that I will increase number of tokens from 500 to 1000.

Is all this correct?

Thanks a lot,
AlexG.

Answer 1 · 2024-05-02T12:43:27.000Z

hi alex, i’m not sure about the duplication part, but i feel like it wont be necessary to duplicate Lang2 text to match the number of lines of Lang1. best jin

…

On May 2, 2024, at 16:55, AlexandderGorodetski ***@***.***> wrote: Hello guys, I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2. I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2. And of course that I will increase number of tokens from 500 to 1000. Is all this correct? Thanks a lot, AlexG. — Reply to this email directly, view it on GitHub <#1612>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42FBIYCDOORW6M4CZW3ZAH5O3AVCNFSM6AAAAABHDJR5I2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TIOJWGIZTOOA>. You are receiving this because you are subscribed to this thread.