The goal of this project is to train a model based on Bidirectional LSTM to separate chinese words in a sentence.
The dataset used for the training was the concatenation of four different datasets: AS (Traditional Chinese), CITYU (Traditional Chinese), MSR (Simplified Chinese) and PKU (Simplified Chinese).
The training was done using a Google Compute Engine instance running a Tesla K80 GPU.
- Generate dictionary
python preprocess.py [resources_path] [sentence_size]
- Train
python train.py [resources_path] [sentence_size]
- Predict
python train.py [input_path] [output_path] [resources_path]
- Score
python train.py [prediction_file] [gold_file]