/burvec

Word Embeddings for Low Resource Languages: The Case of Buryat

Primary LanguageJupyter Notebook

Learning Word Embeddings for Low Resource Languages: The Case of Buryat

Word vector representations have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of low resource languages, particularly when only small corpus is available. In most cases, low resource languages lack traditional natгral language processing instruments like lemmatizer and stemmer. In this study, we introduced a methodology to build word embeddings of low resource languages. The proposed methodology consists of defining accurate preprocessings steps, applying language-independent stemmer, introducing techniques for building word vector representations. In addition, we proposed a simple word embedding evaluation scheme that can be easily adapted to any language. By using this methodology we trained word embeddings for Buryat language. We made the source code and the resulting word embeddings corpus publicly available in order to promote further research.

Buryat Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Erzya Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Komi Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Files for evaluation: bxr myv kv

Contact

For any question, please contact vaskoncv@gmail.com

Cite

@inproceedings{konovalov2018learning,
  title={Learning word embeddings for low resource languages: the case of Buryat},
  author={Konovalov, VP and Tumunbayarova, ZB},
  booktitle={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
  pages={331--341},
  year={2018}
}