about language model
dopc opened this issue · 4 comments
dopc commented
Hey, thanks for great work and sharing it.
I want to ask about https://github.com/RashadGarayev/TRSpeech-to-text#pre-trained-turkish-model.
- Which corpus did you use?
- As I see, it is 2-gram model. Is there a 3- or 4-gram model which you can share?
Looking forward for your answer.
Thanks
RashadGarayev commented
Open source dataset https://commonvoice.mozilla.org/en/datasets - select turkish lang
dopc commented
Thanks for your answer.
But I have asked for textual language model and its corpus.
For the language model, I used kenlm’ lmplz -o 2 < vocabulary > text.arpa build_binary text.arpa lm.binary
in this command,
vocabulary
and
text.arpa
or
lm.binary
many thanks.
RashadGarayev commented
You can parse texts from the Internet. Minimum 10 thousand sentences
dopc commented
okay, thanks so much!