This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.
DATASET
LJSpeech like male voice TTS dataset created from the Mongolian Bible- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASET
Eduge news classification dataset provided by Bolorsoft LLC- used to train the Eduge.mn production news classifier
- 75K news with 9 categories:
урлаг соёл
,эдийн засаг
,эрүүл мэнд
,хууль
,улс төр
,спорт
,технологи
,боловсрол
andбайгал орчин
DATASET
11-11.mn government agency complaint dataset- 80K with 5 categories:
санал хүсэлт
,гомдол
,шүүмжлэл
,талархал
andөргөдөл
- 80K with 5 categories:
DATASET
online news corpus- 700 million words
- opendata.burtgel.gov.mn
DATASET
220K Mongolian personal namesDATASET
90K Mongolian clan/family namesDATASET
192K Mongolian company names
DATASET
250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).- These words could be used also as the stop words.
PYTORCH
tugstugi/pytorch-dc-ttsDEMO
Colab online demoDATASET
LJSpeech like male voice dataset created from the Mongolian Bible
TF
tugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible datasetDEMO
Colab online demoDEMO
speaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
DEMO
HMM TTS online demo of the Mongolian National University- 1x male and 2x female voices
DEMO
Yet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК- 1x male and 1x female
DEMO
Tacotron2 TTS online demo of Ikon.MN- 1x female (35h)
DEMO
HMM based TTS online demo of the Inner Mongolian university- 1x female
PRODUCT
NVDA/HTS screen reader developed by Innovation Development Center for the blind- 1x female (Mongolian National University voice)
MODEL
5-gram binary LM generated by KenLM on a 670M word dirty corpus.- it can be used with mozilla/DeepSpeech:
./generate_trie alphabet.txt mn_5gram.binary trie
- it can be used with mozilla/DeepSpeech:
TF
/PYTORCH
tugstugi/mongolian-bert pretrained Mongolian BERT models- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCH
tugstugi/mongolian-speech-recognition- single voice demo
DEMO
Cyrillic to Mongolian script converter demo of the Inner Mongolian universityDEMO
Mongolian script OCR demo of the Inner Mongolian universityPYTORCH
tugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converterPYTORCH
Mongolian script OCR to be released
PYTORCH
tugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneasDEMO
Colab online demo