/UnifiedGEC

Repository of the paper "UnifiedGEC: Integrating Grammatical Error Correction Approaches for Multi-languages with a Unified Framework"

Primary LanguagePythonApache License 2.0Apache-2.0

UnifiedGEC: Integrating Grammatical Error Correction Approaches for Multi-languages with a Unified Framework

English | 简体中文

UnifiedGEC is an open-source, GEC-oriented framework, which integrates 5 GEC models of different architecture and 7 GEC datasets across different languages. The sturcture of our framework is shown in the picture. It provides abstract classes of dataset, dataloader, evaluator, model and trainer, allowing users to implement their own modules. This ensures excellent extensibility.

Our framework is user-friendly, and users can train a model on a dataset with a single command. Moreover, users are able to deal with low-resource tasks with our proposed data augmentation module, or use given prompts to conduct experiments on LLMs.

Characterisic

  • User-friendly: UnifiedGEC provides users with a convenient way to use our framework. They can start training or inference easily with a command line specifying the model and dataset they need to use. They can also adjust parameters, or launch data augmentation or prompt modules through a single line of command.
  • Modularized and extensible: UnifiedGEC consists of modules including dataset, dataloader, config and so on, and provides users with abstract classses of these modules. Users are allowed to implement their own modules through these classes.
  • Comprehensive: UnifiedGEC has integrated 3 Seq2Seq models, 2 Seq2Edit models, 2 Chinese datasets, 2 English datasets and 3 datasets of other languages. We have conducted experiments on these datasets and evaluated the performance of integerated models, which provides users with a more comprehensive understanding of GEC tasks and models.

Architecture

Complete structure of UnifiedGEC:

.
|-- gectoolkit  # main code of UnifiedGEC
    |-- config  # internal config and implementation of Config Class
    |-- data    # Abstract Class of Dataset and Dataloader, and implementation of GEC Dataloader
    |-- evaluate    # Abstract Class of Evaluator and implementation of GEC Evaluator
    |-- llm     # prompts for LLMs
    |-- model   # Abstract Class of Model and code of integrated models
    |-- module  # reusable modules (e.g., Transformer Layer)
    |-- properties  # detailed external config of each model
    |-- trainer # Abstract Class of Trainer and implementation of SupervisedTrainer
    |-- utils   # other tools used in our framework
    |-- quick_start.py      # code for launching the framework
|-- log         # logs of training process
|-- checkpoint  # results and checkpoints of training process
|-- dataset     # preprocessed datasets in JSON format
|-- augmentation    # data augmentation module
    |-- data    # dependencies of error patterns
    |-- noise_pattern.py    # code of error patterns
    |-- translation.py      # code of back-translation
|-- evaluation  # evaluation module
    |-- m2scorer    # M2Scorer, for NLPCC18、CoNLL14、FCE
    |-- errant      # ERRANT, for AKCES、Falko-MERLIN、Cowsl2h
    |-- cherrant    # ChERRANT, for MuCGEC
    |-- convert.py  # script for convert output JSON file into corresponding format
|-- run_gectoolkit.py       # code for launching the framework

Models

We integrated 5 GEC models in our framework, which can be divided into two categories: Seq2Seq models and Seq2Edit models, as shown in table:

type model reference
Seq2Seq Transformer (Vaswani et al., 2017)
T5 (Xue et al., 2021)
SynGEC (Zhang et al., 2022)
Seq2Edit Levenshtein Transformer (Gu et al., 2019)
GECToR (Omelianchuk et al., 2020)

Datasets

We integrated 7 datasets of different languages in our framework, including Chinese, English, Spanish, Czech and German:

dataset language reference
FCE English (Yannakoudakis et al., 2011)
CoNLL14 English (Ng et al., 2014)
NLPCC18 Chinese (Zhao et al., 2018)
MuCGEC Chinese (Zhang et al., 2022)
COWSL2H Spanish (Yamada et al., 2020)
Falko-MERLIN German (Boyd et al., 2014)
AKCES-GEC Czech (Náplava et al., 2019)

Datasets integrated in UnifiedGEC are in JSON format:

[
    {
        "id": 0,
        "source_text": "My town is a medium size city with eighty thousand inhabitants .",
        "target_text": "My town is a medium - sized city with eighty thousand inhabitants ."
    }
]

Our preprocessed datasets can download here.

Quick Start

Installation

We use Python 3.8 in our experiments. Please install allennlp 1.3.0 first, then install other dependencies:

pip install allennlp==1.3.0
pip install -r requirements.txt

Note: Errors may occur while installing jsonnet with pip. Users are suggested to use conda install jsonnet to finish installation.

Usage

Please create directories for logs and checkpoints before using our framework:

mkdir log
mkdir checkpoint

Users can launch our framework through command line:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME

Refer to ./gectoolkit/config/config.json for parameters related to training process, such as number of epochs, learning rate. Refer to ./gectoolkit/properties/models/ for detailed parameters of each model.

Models except Transformer require pre-trained models, so please download them and store them in the corresponding model directory under ./gectoolkit/properties/models/. We provide download links for some of pre-trained models, and users can also download them from Huggingface.

UnifiedGEC also support adjusting parameters via command line:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME --learning_rate $LR

Adding New Datasets

Our framework allows users to add new datasets. The new dataset folder dataset_name/ should include three json files of train set, valid set and test set, and users need to place the folder in the dataset/ directory:

dataset
    |-- dataset_name
        |-- trainset.json
        |-- validset.json
        |-- testset.json

After that, users also need to add a configuration file dataset_name.json in the gectoolkit/properties/dataset directory, and the contents of the file can refer to other files in the same directory.

Data Augmentation Module

We provide users with two data augmentation methods (for Chinese and English):

  • error patterns: add noises randomly to sentences
  • back-translation: translate sentences into the other language, and then translate back to origin language

Users can use augment in command line to use our data augmentation module, and noise and translation are available values:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME --augment noise

Upon first use, our framework will generate augmented data and save the datasets in the local file, and the back-translation method requires a certain amount of time. UnifiedGEC will use generated data directly while subsequent executions.

Prompts

We also provide prompts for LLMs (for Chinese and English), including zero-shot prompts and few-shot prompts.

Users can use prompts with use_llm in command lines,and specify the number of in-context learning examples with argument example_num.

python run_gectoolkit.py -m $MODEL_NAME -d $ DATASET_NAME --use_llm --example_num $EXAMPLE_NUM

Model name used here should be those from huggingface, such as Qwen/Qwen-7B-chat.

Evaluation

We integrate mainstream evaluation tools for GEC tasks in our evaluation module, including M2Scorer, ERRANT and ChERRANT. Additionally, we also provide scripts for converting and ground truth of some datasets. During the process of training, UnifiedGEC calculate micro-level PRF for the results of models, so if users want to evaluate models in a macro way, they can use this evaluation module.

First, users should use our provided script to convert outputs of the models to the format required by scorers:

python convert.py --predict_file $PREDICT_FILE --dataset $DATASET

Correspondence between datasets and scorers:

数据集 评估工具
CoNLL14、FCE、NLPCC18 M2Scorer
AKCES-GEC、Falko-MERLIN、COWSL2H ERRANT
MuCGEC ChERRANT

M2Scorer

Official repository: https://github.com/nusnlp/m2scorer

For English datasets (CoNLL14、FCE),use M2scorer directly for evaluation:

cd m2scorer
m2scorer/m2scorer predict.txt m2scorer/conll14.gold

For Chinese datasets (NLPCC18),pkunlp tools for segmentation is required. We also provide converting scripts:

cd m2scorer
python pkunlp/convert_output.py --input_file predict.txt --output_file seg_predict.txt
m2scorer/m2scorer seg_predict.txt m2scorer/nlpcc18.gold

ERRANT

Official repository: https://github.com/chrisjbryant/errant

Usage is referenced from official repository:

cd errant
errant_parallel -orig source.txt -cor target.txt -out ref.m2
errant_parallel -orig source.txt -cor predict.txt -out hyp.m2
errant_compare -hyp hyp.m2 -ref ref.m2

ChERRANT

Official repository: https://github.com/HillZhang1999/MuCGEC

Usage is referenced from official repository:

cd cherrant/ChERRANT
python parallel_to_m2.py -f ../hyp.txt -o hyp.m2 -g char
python compare_m2_for_evaluation.py -hyp hyp.m2 -ref ref.m2

Experiment Results

Models

There are 5 models and 7 datasets across different languages integrated in UnifiedGEC, and there is the best performance of implemented models on Chinese and English datasets:

model dataset
CoNLL14(EN) FCE(EN) NLPCC18(ZH) MuCGEC(ZH)
P R F0.5 P R F0.5 P R F0.5 P R F0.5
Levenshtein Transformer 13.5 12.6 13.3 6.3 6.9 6.4 12.6 8.5 10.7 6.6 6.4 6.6
GECToR 52.3 21.7 40.8 36.0 20.7 31.3 30.9 20.9 28.2 33.5 19.1 29.1
Transformer 24.1 15.5 21.7 20.8 15.9 19.6 22.3 20.8 22.0 19.7 9.2 16.0
T5 36.6 39.5 37.1 29.2 29.4 29.3 32.5 21.1 29.4 30.2 14.4 24.8
SynGEC 50.6 51.8 50.9 59.5 52.7 58.0 36.0 36.8 36.2 22.3 26.2 23.6

The best performance of implemented models on datasets of other languages:

model dataset
AKCES-GEC(CS) Falko-MERLIN(DE) COWSL2H
P R F0.5 P R F0.5 P R F0.5
Levenshtein Transformer 4.4 5.0 4.5 2.3 4.2 2.5 1.9 2.3 2.0
GECToR 46.8 8.9 25.3 50.8 20.5 39.2 24.4 12.9 20.7
Transformer 44.4 23.6 37.8 33.1 18.7 28.7 11.8 15.0 12.3
T5 52.5 40.5 49.6 47.4 50.0 47.9 53.7 39.1 49.9
SynGEC 21.9 27.6 22.8 32.2 33.4 32.4 9.3 18.8 10.3

Data Augmentation

We conduct experiments on NLPCC18 and CoNLL14 datasets, and simulate low-resource cases by choosing 10% data from datasets (F0.5/delta F0.5):

model data augmentation methods dataset
CoNLL14 NLPCC18
F0.5 delta F0.5 delta
Levenshtein Transformer w/o augmentation 9.5 - 6.0 -
w/ error patterns 6.4 -3.1 4.9 -1.1
w/ back-translation 12.5 3.0 5.9 -0.1
GECToR w/o augmentation 14.2 - 17.4 -
w/ error patterns 15.1 0.9 19.9 2.5
w/ back-translation 16.7 2.5 19.4 2.0
Transformer w/o augmentation 12.6 - 9.5 -
w/ error patterns 14.5 1.9 9.9 0.4
w/ back-translation 16.6 4.0 10.4 0.9
T5 w/o augmentation 31.7 - 26.3 -
w/ error patterns 32.0 0.3 27.0 0.7
w/ back-translation 32.2 0.5 24.1 -2.2
SynGEC w/o augmentation 47.7 - 32.4 -
w/ error patterns 48.2 0.5 34.9 2.5
w/ back-translation 47.7 0.0 34.6 2.2

Prompts

We use Qwen1.5-14B-chat and Llama2-7B-chat and conduct experiments on NLPCC18 and CoNLL14 datasets (P/R/F0.5):

Setting Dataset
CoNLL14 NLPCC18
P R F0.5 P R F0.5
zero-shot 48.8 49.1 48.8 24.7 38.3 26.6
few-shot 50.4 50.2 50.4 24.8 39.8 26.8

License

UnifiedGEC uses Apache 2.0 License.