This repository stores the code for paper Chinese grammatical error correction based on knowledge distillation
based on HuggingFace🤗. [arXiv]
Pretrain: Wikipedia
Finetune: NLPCC2018 Dataset | HSK 动态作文语料
- Tool:sentencepiece
- Preprocess:Run
./pretrain/data/get_corpus.py
, in which we will get bilingual data to build our training, dev and testing set. The data will be saved incorpus.src
andcorpus.trg
, with one sentence in each line. - Word segmentation model training: Run
./pretrain/tokenizer/tokenize.py
, in which the sentencepiece.SentencePieceTrainer.Train() mothed is called to train our word segmentation model. After training,src.model
,src.vocab
,trg.model
andtrg.vocab
will be saved in./pretrain/tokenizer
..model
is the word segmentation model we need and.vocab
is the vocabulary.
We use the open-source code transformer-pytorch developmented by Harvard.
This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:
- tqdm
- pytorch >= 1.5.1
- sacrebleu >= 1.4.14
- sentencepiece >= 0.1.94
To get the environment settled quickly, run:
pip install -r requirements.txt
Hyperparameters can be modified in ./pretrain/config.py
.
- This code supports MultiGPU training. You should modify
device_id
list inconfig.py
andos.environ['CUDA_VISIBLE_DEVICES']
inmain.py
to use your own GPUs.
To start training, please run:
python ./pretrain/main.py
The training log is saved in ./pretrain/experiment/train.log
, and the translation results of testing dataset is in ./pretrain/experiment/output.txt
.
To start training, please run:
python ./finetune/train.py
python ./pretrain/distillation.py
@article{xia2022chinese,
title={Chinese grammatical error correction based on knowledge distillation},
author={Xia, Peng and Zhou, Yuechi and Zhang, Ziyan and Tang, Zecheng and Li, Juntao},
journal={arXiv preprint arXiv:2208.00351},
year={2022}
}
For any other problems you meet when doing your own project, welcome to issuing or sending emails to me 😊~