RashadGarayev/TRSpeech-to-text

about language model

dopc opened this issue · 4 comments

dopc commented

Hey, thanks for great work and sharing it.

I want to ask about https://github.com/RashadGarayev/TRSpeech-to-text#pre-trained-turkish-model.

  • Which corpus did you use?
  • As I see, it is 2-gram model. Is there a 3- or 4-gram model which you can share?

Looking forward for your answer.
Thanks

Open source dataset https://commonvoice.mozilla.org/en/datasets - select turkish lang

dopc commented

Thanks for your answer.
But I have asked for textual language model and its corpus.

For the language model, I used kenlm’ lmplz -o 2 < vocabulary > text.arpa build_binary text.arpa lm.binary

in this command,

vocabulary

and

text.arpa

or

lm.binary

many thanks.

You can parse texts from the Internet. Minimum 10 thousand sentences

dopc commented

okay, thanks so much!