GECToR – Grammatical Error Correction: Tag, Not Rewrite
This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:
GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)
It is mainly based on AllenNLP
and transformers
.
Installation
The following command installs all necessary packages:
pip install -r requirements.txt
The project was tested using Python 3.7.
Datasets
All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE
Pretrained models
Pretrained encoder | Confidence bias | Min error prob | CoNNL-2014 (test) | BEA-2019 (test) |
---|---|---|---|---|
BERT [link] | 0.10 | 0.41 | 63.0 | 67.6 |
RoBERTa [link] | 0.20 | 0.50 | 64.0 | 71.5 |
XLNet [link] | 0.35 | 0.66 | 65.3 | 72.4 |
RoBERTa + XLNet | 0.24 | 0.45 | 66.0 | 73.7 |
BERT + RoBERTa + XLNet | 0.16 | 0.40 | 66.5 | 73.6 |
Train model
To train the model, simply run:
python train.py --train_set TRAIN_SET --dev_set DEV_SET \
--model_dir MODEL_DIR
There are a lot of parameters to specify among them:
cold_steps_count
the number of epochs where we train only last linear layertransformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}
model encodertn_prob
probability of getting sentences with no errors; helps to balance precision/recallpieces_per_token
maximum number of subwords per token; helps not to get CUDA out of memory
In our experiments we had 98/2 train/dev split.
Model inference
To run your model on the input file use the following command:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE
Among parameters:
min_error_probability
- minimum error probability (as in the paper)additional_confidence
- confidence bias (as in the paper)special_tokens_fix
to reproduce some reported results of pretrained models
For evaluation use M^2Scorer and ERRANT.
Citation
If you find this work is useful for your research, please cite our paper:
@misc{omelianchuk2020gector,
title={GECToR -- Grammatical Error Correction: Tag, Not Rewrite},
author={Kostiantyn Omelianchuk and Vitaliy Atrasevych and Artem Chernodub and Oleksandr Skurzhanskyi},
year={2020},
eprint={2005.12592},
archivePrefix={arXiv},
primaryClass={cs.CL}
}