/awesome-baltic-nlp

A collection of resources for Natural Language Programming resources for the Baltics languages

awesome-baltic-nlp

A collection of resources for Natural Language Programming resources for the Baltic languages (Latvian, Lithuanian and Estonian) Table of Contents

General

Latvian Language

Datasets

Tools & models

Word-Embeddings

  • FastText pre-trained word vectors (1): bin, text The word vectors where trained on Common Crawl and Wikipedia using fastText. See documentation at Fasttext.cc
  • FastText pre-trained word vectors (2): bin+text, text The word vectors where trained on Wikipedia using fastText. See documentation at Fasttext.cc

Relevant organisations working on NLP research, projects and products

  • AI Lab by Latvian University

Notable researchers working in developing Latvian NLP / NLU tools, datasets and more

Dr. Comp. Sc. Inguna Skadiņa -- Publications -- CV

^ back to top ^

Lithuanian Language

Datasets

Tools and models

Part-of-Speech tagging and dependency parsing

  • spaCy Lithuanian multi-task CNN trained on UD Lithuanian ALKSNIS and TokenMill.lt news corpus. Assigns context-specific token vectors, POS tags, dependency parses and named entities. 3 different models and label scheme included in the documentation.

Word-Embeddings

  • FastText pre-trained word vectors: bin, text The word vectors where trained on Common Crawl and Wikipedia using fastText. See documentation at Fasttext.cc
  • FastText pre-trained word vectors (2): bin+text, text The word vectors where trained on Wikipedia using fastText. See documentation at Fasttext.cc
  • Also available for Samogitian language: bin+text, text
  • Polyglot Latvian word embeddings (scroll down in the table) polyglot embeddings

Other

  • Rasa NLU COVID model An open-source model for building an AI assistant to help disseminate information about the virus, how to stay safe, and where to seek help.

^ back to top ^

Estonian Language

Word-Embeddings

  • FastText pre-trained word vectors: bin, text The word vectors where trained on Common Crawl and Wikipedia using fastText. See documentation at Fasttext.cc
  • FastText pre-trained word vectors (2): bin+text, text The word vectors where trained on Wikipedia using fastText. See documentation at Fasttext.cc
  • Polyglot Latvian word embeddings (scroll down in the table) polyglot embeddings

^ back to top ^