THIS IS A WORK IN PROGRESS
Re-implementation of the system described in this paper, a POS tagger designed for low resource languages. The goal is to make it available and usable for anyone facing low resources issues in NLP.
MSETagger
POS tagging for low-resource languages, using specialized MorphoSyntactic Embeddings
Dependecies
The tagger is built on top of yaset for the Bi-LSTM tagger part and mimick to compute embeddings of OOVs
Requirements
- SBT
- a way to create virtual environments for Python
- some data for your favorite language
Installation
- clone the repo
- create two virtual env for python 2 and 3 (sorry, yaset is in python3 and mimick in 2...) and
pip install -r requirements(2|3).txt
for each sbt compile
the scala code
Usage
Every options are set in the application.conf file, it includes:
- env section
- path to the two python interpreters
- tokenizer section
- strategy is either a
basic
whitespace tokenizer or a call to anexternal
command - if
strategy
isexternal
, you need to specify acommand
to be called
- strategy is either a
- embeddings section
file
is the path to/from which embeddings are savedoovfile
list the forms for which embeddings are to be generatedndim
is the number of dimensions of the embeddingsws
is the window sizenboccmin
is a minimum number of occurrences of a form to have the embeddings computed my MSE (otherwise it will be computed as an OOV)
yaset.model
is a zip file containing all the data created/needed by yaset- corpora section defines the paths to the corpus file (to be read and/or written)