This is code for preprocessing data, training model and inferring word segment boundaries of Thai text with bi-directional recurrent neural network. The model provides precision of 99.04%, recall of 99.31% and F1 score of 99.18%. Please see the blog post for the detailed description of the model.
- Python 3.4
- TensorFlow 1.4
- NumPy 1.13
- scikit-learn 0.18
preprocess.py
: Preprocess corpus for model trainingtrain.py
: Train the Thai word segmentation modelpredict_example.py
: Example usage of the model to segment Thai wordssaved_model
: Pretrained model weightsthainlplib/labeller.py
: Methods for preprocessing the corpusthainlplib/model.py
: Methods for training the model
Note that the InterBEST 2009 corpus is not included, but can be downloaded from the NECTEC website.
To try the prediction demo, run python3 predict_example.py
.
To preprocess the data and train the model, put the data files under data
directory and then
run python3 preprocess.py
and python3 train.py
.
- Jussi Jousimo
- Natsuda Laokulrat
- Ben Carr
GPL 3.0
Copyright (c) Sertis Co., Ltd., 2017