/nlpcc2016-chinese-weibo-segmentation

The 1st solution (close and semi-open track) in NLPCC 2016 Chinese Weibo Segmentation

Primary LanguagePython

nlpcc2016-chinese-weibo-segmentation

Long Short-Term Memory (LSTM) based model for the NLPCC 2016 shared task - Chinese Weibo word segmentation. The model got 1st place in close and semi-open track. For more details, refer to our paper Recurrent Neural Word Segmentation with Tag Inference.

Requirement

  • Theano
  • Lasagne

Notes

  • The original dataset for this task should be requested by filling up a Agreement Form. So here we only provide a few examples.
  • Once the original dataset is obtained, one should change the space-splited format to BMES tagging format.
  • To get the unsupervised features, use scripts by Wu et al., 2014 CistSegment.

Run

  • Preparing the data (see the Notes)
  • run the script ccl_nlpcc.py
  • run the script chunkvec_inference.py

Citation

If you use this software, please cite our paper.

@InProceedings{zhou2016lstmtaginference,
  Title                    = {Recurrent neural word segmentation with tag inference},
  Author                   = {Qianrong Zhou, Long Ma, Zhenyu Zheng, Yue Wang, and Xiaojie Wang},
  Booktitle                = {Proceedings of The Fifth Conference on Natural Language Processing and Chinese Computing \& The Twenty Fourth
International Conference on Computer Processing of Oriental Languages},
  Year                     = {2016}
}