This is the implementation of Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge at ACL2020.
You can e-mail Yuanhe Tian at yhtian@uw.edu
or Guimin Chen at chenguimin@chuangxin.com
, if you have any questions.
If you use or extend our work, please cite our paper at ACL2020.
@inproceedings{tian-etal-2020-joint,
title = "Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge",
author = "Tian, Yuanhe and Song, Yan and Ao, Xiang and Xia, Fei and Quan, Xiaojun and Zhang, Tong and Wang, Yonggang",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
pages = "8286--8296",
}
Our code works with the following environment.
python=3.6
pytorch=1.1
To run Stanford CoreNLP Toolkit, you need
Java 8
To run Berkeley Neural Parser, you need
tensorfolw==1.13.1
benepar[cpu]
cython
Note that Berkeley Neural Parser does not support TensorFlow 2.0
.
You can refer to their websites for more information.
In our paper, we use BERT (paper) and ZEN (paper) as the encoder.
For BERT, please download pre-trained BERT-Base Chinese from Google or from HuggingFace. If you download it from Google, you need to convert the model from TensorFlow version to PyTorch version.
For ZEN, you can download the pre-trained model form here.
For TwASP, you can download the models we trained in our experiments from here.
Run run_sample.sh
to train a model on the small sample data under the sample_data
folder.
We use CTB5, CTB6, CTB7, CTB9, and Universal Dependencies 2.4 (UD) in our paper.
To obtain and pre-process the data, you can go to data_preprocessing
directory and run getdata.sh
. This script will download and process the official data from UD. For CTB5 (LDC05T01), CTB6 (LDC07T36), CTB7 (LDC10T07), and CTB9 (LDC2016T13), you need to obtain the official data yourself, and then put the raw data folder under the data_preprocessing
directory.
The script will also download the Stanford CoreNLP Toolkit v3.9.2 (SCT) and Berkeley Neural Parser (BNP) to obtain the auto-analyzed syntactic knowledge. You can refer to their website for more information.
All processed data will appear in data
directory organized by the datasets, where each of them contains the files with the same file names under the sample_data
directory.
You can find the command lines to train and test model on a specific dataset with the part-of-speech (POS) knowledge from Stanford CoreNLP Toolkit v3.9.2 (SCT) in run.sh
.
Here are some important parameters:
--do_train
: train the model--do_test
: test the model--use_bert
: use BERT as encoder--use_zen
: use ZEN as encoder--bert_model
: the directory of pre-trained BERT/ZEN model--use_attention
: use two-way attention--source
: the toolkit to be use (stanford
orberkeley
)--feature_flag
: usepos
,chunk
, ordep
knowledge--model_name
: the name of model to save
run_sample.sh
contains the command line to segment and tag the sentences in an input file (./sample_data/sentence.txt).
Here are some important parameters:
--do_predict
: segment and tag the sentences using a pre-trained TwASP model.--input_file
: the file contains sentences to be segmented and tagged. Each line contains one sentence; you can refer to a sample input file for the input format.--output_file
: the path of the output file. Words are segmented by a space; POS labels are attached to the resulting words by an underline ("_").--eval_model
: the pre-trained WMSeg model to be used to segment the sentences in the input file.
To run a pre-trained TwASP model, you need to install SCT and BNP to obtain the auto-analyzed syntactic knowledge. See data_processing for more information to download the two toolkits.
- Regular maintenance
You can leave comments in the Issues
section, if you want us to implement any functions.
You can check our updates at updates.md.