Morphological segmentation

Experimenting with supervised morphological segmentation as a seq2seq problem.

Currently two supervised models are supported: seq2seq and LSTM (baseline).

Setup

pip install -r requirements.txt
python setup.py install

Input data

Tre training scripts (train.py) expect the training input as either as its only positional argument or it reads it from standard input if no positional argument is provided. Gzip files are supported. The training data is expected to have one line per sample. The input and the output sequences should be separated by TAB.

Example:

autót	autó t
ablakokat	ablak ok at

The inference scripts (inference.py) also read from the standard input and expects one sample per line.

seq2seq

The seq2seq source code is located in the morph_seg/seq2seq directory. It uses Tensorflow's legacy_seq2seq.

Training your own model

cat training_data | python morph_seg/seq2seq/train.py --save-test-output test_output --save-model model_directory --cell-size 64 --result-file results.tsv

This will train a seq2seq model with the default arguments listed in train.py:

argument	default	explanation
`save-test-output`	`None`	Save the model's output on the test set (randomly sampled)
`save-model`	`None`	Save the model and other stuff needed for inference. This should be an exisiting directory.
`result-file`	`None`	Save the experiment's configuration and the result statistics.
`cell-type`	`LSTM`	Use LSTM or GRU cells.
`cell-size`	16	Number of LSTM/GRU cells to use.
`layers`	1	Number of layers.
`embedding-size`	20	Dimension of embedding.
`early-stopping-threshold`	0.001	Stop training when val loss does not change more than this threshold for N steps.
`early-stopping-patience`	10	Stop training if val loss does not change more than the threshold for N steps.

Note that the first three arguments' default is None. This means that unless specified, they do not write to file. They are not linked though, any one can be left out.

Using your model for inference

train.py saves everything needed for inference to the directory specified by the save-model argument. Inference can be run like this:

cat test_data | python morph_seg/seq2seq/inference.py --model-dir your_saved_model

Note that longer samples than the maximum length in the training data will be trimmed from their beginning.

LSTM

The LSTM source code is located in the morph_seg/sequence_tagger directory. It uses Keras's LSTM, GRU modules, and the usage is basically identical to the seq2seq model above.

Training your own model

cat training_data | python morph_seg/sequence_tagger/train.py --save-test-output test_output --save-model model_directory --cell-size 64 --result-file results.tsv

Using your model for inference

cat test_data | python morph_seg/sequence_tagger/inference.py --model-dir your_saved_model

e9t/morph-segmentation

Morphological segmentation

Setup

Input data

seq2seq

Training your own model

Using your model for inference

LSTM

Training your own model

Using your model for inference