/MorphemeSegmentation

This is a survey of morpheme segmentation techniques including 2 baselines (BertTokenizer, Morfessor 2.0) and 2 supervised (LSTM, Transformer).

Primary LanguagePython

Morpheme Segmentation with LSTM and Transformers

Morpheme segmentation is the process of separating words into their fundamental units of meaning. For example:

  • foundationalismfound+ation+al+ism

This project is a reproduction of the 2nd and 1st place systems from the 2022 Sigmorphon competition in word segmentation. These systems, from Ben Peters and Andre Martins, are DeepSPIN-2, a recurrent neural network (LSTM) model with character-level tokenization, and DeepSPIN-3 a transformer based model that uses an entmax loss function and ULM subword tokenization.

Organization

This repository is organized as such:

baseline/
baseline/bert       # simple BertTokenizer generator and evaluator
baseline/morfessor  # simple Morfessor 2.0 trainer, generator, and evaluator
deepspin            # flexible implementation of DeepSpin-2 and DeepSpin-3 with fairseq and an LSTM architecture as outlined in the paper above.
yoyodyne            # A basic implementation using yoyodyne - https://github.com/CUNY-CL/yoyodyne
lstm                # an LSTM work-in-progress architecture built with basic PyTorch (for academic purposes).
  1. The baseline directory has 2 scripts for generating baseline segmentations. One uses a pretrained BertTokenizer (baseline/bert) and the other uses Morfessor 2.0 (baseline/morfessor), an unsupervised utility that is not pretrained.

In the case of DeepSPIN-2 and DeepSPIN-3, the original implementations were written by Ben Peters, but the scripts in this repository streamline their usage and decouple tokenization from each. This enables exploring of a transformer architecture with character-level encoding or an LSTM architecture with subword tokenization. This is helpful to really determine whether subword tokenization is a crucial ingredient in the high performance of DeepSPIN-3. Spoiler alert: it only accounts for 0.2% of the F-score.

The Data

Here is a sample of the training data for Hungarian:

tanításokért	tanít @@ás @@ok @@ért	110
Algériánál	Algéria @@nál	100
metélőhagymába	metélő @@hagyma @@ba	101
fülésztől	fül @@ész @@től	110

After training, the model is expected to be able to receive just the first column (the untokenized word) and be able to separate it into morphemes, with the @@ morpheme separator. The final column, which can be used for training has 3 bits that represent the types of morphology (inflection, derivation, & compounding). This is currently unused.

Setup

  • Make sure to clone this with --recurse-submodules to ensure you get the data from the competition.
  • After creating a virtual environment with python -m venv <name> or conda create..., you can install necessary python libraries with pip install -r requirements.txt from the root directory of this repository.

Training

Please refer to README's within each subdirectory for training and evaluation for each individual architecture.