/otomi-morph-segmenter

Generador de glosa automática para el otomí de toluca

Primary LanguageJupyter NotebookMIT LicenseMIT

Automatic gloss generator for Otomí language

Dependencies

Dependency Manager

Python packages

  • scikit-learn
  • jupyter
  • jupyterlab
  • matplotlib
  • pandas
  • python-crfsuite

Instalation

$ poetry install

Notebooks and examples

Experimental enviroments

Training pipelines are available inside the notebooks/ folder. Each notebook can be executed and reproduce cell by cell.

  • linearCRF: This setting considers all the information available. Features are mentioned inside notebooks in the first cell.
  • POSLess: In this setting we excluded the POS tags.
  • HMMLike: This setting takes into account the minimum information, i.e. information about the current letter and the immediately preceding one. We use this name because this configuration contains similar information as the HMMs but using CRFs to build the.

Examples

Inside notebooks/ folder there are notebook with the postfix _ejemplos.ipynb for experimental enviroment. Those notebooks are useful to see pre-trained models in acton.

Baseline: HMMLike

  • L1 = 0.0
  • L2 = 0.0
  • Max de iterions = 50
  • model name: HMMLike_baseline_k_[1-3].crfsuite

Preprocessing

Corpus depuration

  • Delete duplicated lines
    • $ sort -u corpus > corpus_uniq
  • Show duplicated lines
    • $ diff --color corpus_sort corpus_uniq

Conventions

Character substitutions

To solve encoding/decogding problems with python-crfsuite we substitute next otomí characters:

  • u̱ -> μ
  • a̱̱ -> α
  • e̱ -> ε
  • i̱ -> ι

Pipeline

arquitectura

  1. Get the glossed corpus
  2. Text preprocessing
  3. Make the feature lists for each letter in sentences
  4. Split test and train sets
  5. Training and models build
  6. Tags generations and performance tests