/helo_word

Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task

Primary LanguagePythonApache License 2.0Apache-2.0

helo_word

A Neural Grammatical Error Correction System Built on Better Pre-training and Sequential Transfer Learning

Code accompanying Team Kakao&Brain's submission to the ACL 2019 BEA Workshop Shared Task.
(helo_word is our informal team name.)

Paper: https://arxiv.org/abs/1907.01256

ACL Anthology: https://www.aclweb.org/anthology/papers/W/W19/W19-4423/

Authors

YJ Choe^, Jiyeon Ham^, Kyubyong Park^, Yeoil Yoon^

^Equal contribution.

Installation

Requires Python 3.

# apt-get packages (required for hunspell & pattern)
apt-get update
apt-get install libhunspell-dev libmysqlclient-dev -y

# pip packages
pip install --upgrade pip
pip install --upgrade -r requirements.txt
python -m spacy download en

# custom fairseq (fork of 0.6.1 with gec modifications)
pip install --editable fairseq

# errant
git clone https://github.com/chrisjbryant/errant

# pattern3 (see https://www.clips.uantwerpen.be/pages/pattern for any installation issues)
pip install pattern3
python -c "import site; print(site.getsitepackages())"
# ['PATH_TO_SITE_PACKAGES']
cp tree.py PATH_TO_SITE_PACKAGES/pattern3/text/

Download & Preprocess Data

python preprocess.py

Restricted Track

  • Prepare data for the restricted track
    python prepare.py --track 1
  • Pre-train
    • If you train the model, the system will automatically create a checkpoint directory.
    • Fill it in {ckpt_dir}.
    • Also fill in the number of GPUs used for training in {ngpu}.
    python train.py --track 1 --train-mode pretrain --model base --ngpu {ngpu}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
  • Train
    • If you evaluate the model, the system will automatically create an output directory.
    • Fill the previous model output directory into {prev_model_output_dir}.
    python train.py --track 1 --train-mode train --model base --ngpu {ngpu} \
        --lr 1e-4 --max-epoch 40 --reset --prev-model-output-dir {prev_model_output_dir}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
  • Fine-tune
    • Fill the best validation report into {prev_model_output_fpath}.
    • Then error_type_control.py will give you a list of error types to be removed.
    • Fill them into {remove_error_type_lst}.
    python train.py --track 1 --train-mode finetune --model base --ngpu {ngpu} \
        --lr 5e-5 --max-epoch 80 --reset --prev-model-output-dir {prev_model_output_dir}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
    python error_type_control.py --report {prev_model_output_fpath} \
        --max_error_types 10 --n_simulations 1000000
    python evaluate.py --track 1 --subset test --ckpt-fpath {ckpt_fpath} \
        --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
        --apply-rerank --preserve-spell --max-edits 7 

Low Resource Track

  • Prepare data for the low resource track
    python prepare.py --track 3
  • Pre-train
    python train.py --track 3 --train-mode pretrain --model base --ngpu {ngpu}
    python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}
  • Train
    python train.py --track 3 --train-mode finetune --model base --ngpu {ngpu} \
        --max-epoch 40 --prev-model-output-dir {prev_model_output_dir} 
    python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}
    python evaluate.py --track 3 --subset test --ckpt-fpath {ckpt_fpath} \
        --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
        --apply-rerank --preserve-spell --max-edits 7 

A Note on fairseq

We ran our Transformer models using fairseq-0.6.1. We had to make several modifications to the package though, including our own implementation of the copy-augmented Transformer model. You can find all of our modifications in fairseq/MODIFICATIONS.md.

Citation

If you use our code for research, please cite our work as:

@inproceedings{choe-etal-2019-neural,
    title = "A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning",
    author = "Choe, Yo Joong  and
      Ham, Jiyeon  and
      Park, Kyubyong  and
      Yoon, Yeoil",
    booktitle = "Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4423",
    pages = "213--227",
    abstract = "Grammatical error correction can be viewed as a low-resource sequence-to-sequence task, because publicly available parallel corpora are limited.To tackle this challenge, we first generate erroneous versions of large unannotated corpora using a realistic noising function. The resulting parallel corpora are sub-sequently used to pre-train Transformer models. Then, by sequentially applying transfer learning, we adapt these models to the domain and style of the test set. Combined with a context-aware neural spellchecker, our system achieves competitive results in both restricted and low resource tracks in ACL 2019 BEAShared Task. We release all of our code and materials for reproducibility.",
}