Code for ICASSP2023: Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision (Word Alignment)
we propose a novel augmentation method of ancient Chinese WSG and POS tagging data using word alignment over parallel corpus
1、Get the raw unlabelled data:
(1) Download the data from https://github.com/NiuTrans/Classical-Modern;
(2) Rename the folder “双语数据” to “bitext” and place it under folder "ACDS";
(3) Get modern Chinese sentences ("src") and ancient Chinese sentences ("tgt")
python make_data.py
(4) Shuffle modern Chinese sentences and ancient Chinese sentences to get "src_shuf" and "tag_shuf".
python shuffle_data.py
(5) Get the segmented、pos tagged modern Chinese data "src_shuf_seg、src_shuf_pos" by LTP (https://github.com/HIT-SCIR/ltp) and single- character splited ancient Chinese data "tgt_shuf_seg"
python make_segpos_data.py
1、Download the alignment tool giza:
(1) Download the tool giza from https://github.com/sillsdev/giza-py and put it in "ACDS" as a folder "giza";
(2) Configure the tool giza as described in https://github.com/sillsdev/giza-py.
2、Get the alignment probabilities Get the alignment probabilities between segmented ancient words and single-character splited ancient Chinese words. The output result is file "alignment".
cd giza
python giza.py --source ../src_shuf_seg --target ../tgt_shuf_seg --alignments ../alignment --model ibm4 --m1 10 --mh 10 --m3 10 --m4 10 --include-probs
Get the labelled ancient Chinese data "tgt.shuf.seg_pos".
cd ..
python align-pos_tag_ltp.py
After getting the augmented data "tgt.shuf.seg_pos" from word alignment, you can train the SIKU-RoBerta (https://huggingface.co/SIKU-BERT/sikuroberta) like this repository (https://github.com/farlit/The-first-ancient-Chinese-word-segmentation-and-part-of-speech-tagging-code-and-analysis).
The augmented training dataset is "tgt.shuf.seg_pos"; the annotated training dataset is "zuozhuan_train_utf8"; the validation datasets are "EvaHan_testa_gold" and "EvaHan_testb_gold".
The annotated dataset and validation datasets are from https://github.com/RacheleSprugnoli/LT4HALA/tree/master/2022/data_and_doc.
Our code refers to https://github.com/hemingkx/WordSeg, thanks for their work.