/KD-CGEC

Code for Chinese grammatical error correction based on knowledge distillation

Primary LanguagePythonMIT LicenseMIT

KD-CGEC

arXiv arXiv Github stars Maintenance PR's Welcome

This repository stores the code for paper Chinese grammatical error correction based on knowledge distillation based on HuggingFace🤗. [arXiv]

Data

Pretrain: Wikipedia

Finetune: NLPCC2018 Dataset | HSK 动态作文语料

Data Process

Word Segmentation

  • Toolsentencepiece
  • Preprocess:Run ./pretrain/data/get_corpus.py , in which we will get bilingual data to build our training, dev and testing set. The data will be saved in corpus.src and corpus.trg, with one sentence in each line.
  • Word segmentation model training: Run ./pretrain/tokenizer/tokenize.py, in which the sentencepiece.SentencePieceTrainer.Train() mothed is called to train our word segmentation model. After training, src.modelsrc.vocabtrg.model and trg.vocab will be saved in ./pretrain/tokenizer. .model is the word segmentation model we need and .vocab is the vocabulary.

Transformer Model

We use the open-source code transformer-pytorch developmented by Harvard.

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:

  • tqdm
  • pytorch >= 1.5.1
  • sacrebleu >= 1.4.14
  • sentencepiece >= 0.1.94

To get the environment settled quickly, run:

pip install -r requirements.txt

Usage

Pretrain

Hyperparameters can be modified in ./pretrain/config.py.

  • This code supports MultiGPU training. You should modify device_id list in config.py and os.environ['CUDA_VISIBLE_DEVICES'] in main.py to use your own GPUs.

To start training, please run:

python ./pretrain/main.py

The training log is saved in ./pretrain/experiment/train.log, and the translation results of testing dataset is in ./pretrain/experiment/output.txt.

Finetune

To start training, please run:

python ./finetune/train.py

Distillation

python ./pretrain/distillation.py

Reference

@article{xia2022chinese,
  title={Chinese grammatical error correction based on knowledge distillation},
  author={Xia, Peng and Zhou, Yuechi and Zhang, Ziyan and Tang, Zecheng and Li, Juntao},
  journal={arXiv preprint arXiv:2208.00351},
  year={2022}
}

For any other problems you meet when doing your own project, welcome to issuing or sending emails to me 😊~