ID-CNN-CWS

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation".

It implements the following 4 models for CWS:

Dependencies

Both CPU and GPU are supported. GPU training is 10 times faster.

Run following script to convert corpus to TensorFlow dataset.

$ ./scripts/make.sh

$ ./scripts/run.sh $dataset $model

For example:

$ ./scripts/run.sh pku cnn

It will train a cnn model on pku dataset, then evaluate performance on test set.

To enable CRF layer, simply append --viterbi to your command, e.g.

$ ./scripts/run.sh pku cnn --viterbi

Corpora are from SIGHAN05, converted to Simplified Chinese via HanLP. Note that the SIGHAN datasets should only be used for research purposes.
Model implementations adopted from https://github.com/iesl/dilated-cnn-ner by Emma Strubell.