Multi Lingual model
AlexandderGorodetski opened this issue · 1 comments
AlexandderGorodetski commented
Hello guys,
I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.
I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.
And of course that I will increase number of tokens from 500 to 1000.
Is all this correct?
Thanks a lot,
AlexG.
JinZr commented
hi alex,
i’m not sure about the duplication part, but i feel like it wont be necessary to duplicate Lang2 text to match the number of lines of Lang1.
best
jin
… On May 2, 2024, at 16:55, AlexandderGorodetski ***@***.***> wrote:
Hello guys,
I have to train multi lingual model using my inhouse data. I have 10K hours for Lang1 and 5K hours for Lang2.
I wanted to ask you about BPE algorithm. Because Lang1 has 2 times more data, therefore I guess that I have to duplicate textual data of Lang2 two times so that number of tokens from Lang1 will be approximately same like in Lang2.
And of course that I will increase number of tokens from 500 to 1000.
Is all this correct?
Thanks a lot,
AlexG.
—
Reply to this email directly, view it on GitHub <#1612>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42FBIYCDOORW6M4CZW3ZAH5O3AVCNFSM6AAAAABHDJR5I2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TIOJWGIZTOOA>.
You are receiving this because you are subscribed to this thread.