GECToR – Grammatical Error Correction: Tag, Not Rewrite

The code for training and testing state-of-the-art models for grammatical error correction is from the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded in errorify
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Example:

python utils/preprocess_data.py -s data/synthetic/train_incorr_sentences.txt -t data/synthetic/train_corr_sentences.txt -o data/synthetic/train_data.txt

Pretrained models

Pretrained encoder	Confidence bias	Min error prob	CoNNL-2014 (test)	BEA-2019 (test)
BERT [link]	0.10	0.41	63.0	67.6
RoBERTa [link]	0.20	0.50	64.0	71.5
XLNet [link]	0.35	0.66	65.3	72.4
RoBERTa + XLNet	0.24	0.45	66.0	73.7
BERT + RoBERTa + XLNet	0.16	0.40	66.5	73.6

Train model

To train the model, simply run:

python train.py --train_set TRAIN_FILE --dev_set DEV_FILE \
                --model_dir MODEL_DIR --vocab_path VOCAB_PATH

Example:

python train.py --train_set data/synthetic/train_data.txt --dev_set data/synthetic/dev_data.txt --model_dir model/ --vocab_path model/vocabulary/

There are a lot of parameters to specify among them:

cold_steps_count the number of epochs where we train only last linear layer
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
tn_prob probability of getting sentences with no errors; helps to balance precision/recall
pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_FILE [MODEL_FILE ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Example:

python predict.py --model_path model/model.th --vocab_path model/vocabulary/ --input_file data/synthetic/test_incorr_sentencese.txt --output_file data/synthetic/test_pred.txt

Among parameters:

min_error_probability - minimum error probability (as in the paper)
additional_confidence - confidence bias (as in the paper)
special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

stephen-cheng/grammar_correction_with_bert