UnifiedGEC: Integrating Grammatical Error Correction Approaches for Multi-languages with a Unified Framework

English | 简体中文

UnifiedGEC is an open-source, GEC-oriented framework, which integrates 5 GEC models of different architecture and 7 GEC datasets across different languages. The sturcture of our framework is shown in the picture. It provides abstract classes of dataset, dataloader, evaluator, model and trainer, allowing users to implement their own modules. This ensures excellent extensibility.

Our framework is user-friendly, and users can train a model on a dataset with a single command. Moreover, users are able to deal with low-resource tasks with our proposed data augmentation module, or use given prompts to conduct experiments on LLMs.

Characterisic

User-friendly: UnifiedGEC provides users with a convenient way to use our framework. They can start training or inference easily with a command line specifying the model and dataset they need to use. They can also adjust parameters, or launch data augmentation or prompt modules through a single line of command.
Modularized and extensible: UnifiedGEC consists of modules including dataset, dataloader, config and so on, and provides users with abstract classses of these modules. Users are allowed to implement their own modules through these classes.
Comprehensive: UnifiedGEC has integrated 3 Seq2Seq models, 2 Seq2Edit models, 2 Chinese datasets, 2 English datasets and 3 datasets of other languages. We have conducted experiments on these datasets and evaluated the performance of integerated models, which provides users with a more comprehensive understanding of GEC tasks and models.

Architecture

Complete structure of UnifiedGEC:

.
|-- gectoolkit  # main code of UnifiedGEC
    |-- config  # internal config and implementation of Config Class
    |-- data    # Abstract Class of Dataset and Dataloader, and implementation of GEC Dataloader
    |-- evaluate    # Abstract Class of Evaluator and implementation of GEC Evaluator
    |-- llm     # prompts for LLMs
    |-- model   # Abstract Class of Model and code of integrated models
    |-- module  # reusable modules (e.g., Transformer Layer)
    |-- properties  # detailed external config of each model
    |-- trainer # Abstract Class of Trainer and implementation of SupervisedTrainer
    |-- utils   # other tools used in our framework
    |-- quick_start.py      # code for launching the framework
|-- log         # logs of training process
|-- checkpoint  # results and checkpoints of training process
|-- dataset     # preprocessed datasets in JSON format
|-- augmentation    # data augmentation module
    |-- data    # dependencies of error patterns
    |-- noise_pattern.py    # code of error patterns
    |-- translation.py      # code of back-translation
|-- evaluation  # evaluation module
    |-- m2scorer    # M2Scorer, for NLPCC18、CoNLL14、FCE
    |-- errant      # ERRANT, for AKCES、Falko-MERLIN、Cowsl2h
    |-- cherrant    # ChERRANT, for MuCGEC
    |-- convert.py  # script for convert output JSON file into corresponding format
|-- run_gectoolkit.py       # code for launching the framework

Models

We integrated 5 GEC models in our framework, which can be divided into two categories: Seq2Seq models and Seq2Edit models, as shown in table:

type	model	reference
Seq2Seq	Transformer	(Vaswani et al., 2017)
	T5	(Xue et al., 2021)
	SynGEC	(Zhang et al., 2022)
Seq2Edit	Levenshtein Transformer	(Gu et al., 2019)
Seq2Edit	GECToR	(Omelianchuk et al., 2020)

Datasets

We integrated 7 datasets of different languages in our framework, including Chinese, English, Spanish, Czech and German:

dataset	language	reference
FCE	English	(Yannakoudakis et al., 2011)
CoNLL14	English	(Ng et al., 2014)
NLPCC18	Chinese	(Zhao et al., 2018)
MuCGEC	Chinese	(Zhang et al., 2022)
COWSL2H	Spanish	(Yamada et al., 2020)
Falko-MERLIN	German	(Boyd et al., 2014)
AKCES-GEC	Czech	(Náplava et al., 2019)

Datasets integrated in UnifiedGEC are in JSON format:

[
    {
        "id": 0,
        "source_text": "My town is a medium size city with eighty thousand inhabitants .",
        "target_text": "My town is a medium - sized city with eighty thousand inhabitants ."
    }
]

Our preprocessed datasets can download here.

Quick Start

Installation

We use Python 3.8 in our experiments. Please install allennlp 1.3.0 first, then install other dependencies:

pip install allennlp==1.3.0
pip install -r requirements.txt

Note: Errors may occur while installing jsonnet with pip. Users are suggested to use conda install jsonnet to finish installation.

Usage

Please create directories for logs and checkpoints before using our framework:

mkdir log
mkdir checkpoint

Users can launch our framework through command line:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME

Refer to ./gectoolkit/config/config.json for parameters related to training process, such as number of epochs, learning rate. Refer to ./gectoolkit/properties/models/ for detailed parameters of each model.

Models except Transformer require pre-trained models, so please download them and store them in the corresponding model directory under ./gectoolkit/properties/models/. We provide download links for some of pre-trained models, and users can also download them from Huggingface.

UnifiedGEC also support adjusting parameters via command line:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME --learning_rate $LR

Adding New Datasets

Our framework allows users to add new datasets. The new dataset folder dataset_name/ should include three json files of train set, valid set and test set, and users need to place the folder in the dataset/ directory:

dataset
    |-- dataset_name
        |-- trainset.json
        |-- validset.json
        |-- testset.json

After that, users also need to add a configuration file dataset_name.json in the gectoolkit/properties/dataset directory, and the contents of the file can refer to other files in the same directory.

Data Augmentation Module

We provide users with two data augmentation methods (for Chinese and English):

error patterns: add noises randomly to sentences
back-translation: translate sentences into the other language, and then translate back to origin language

Users can use augment in command line to use our data augmentation module, and noise and translation are available values:

python run_gectoolkit.py -m $MODEL_NAME -d $DATASET_NAME --augment noise

Upon first use, our framework will generate augmented data and save the datasets in the local file, and the back-translation method requires a certain amount of time. UnifiedGEC will use generated data directly while subsequent executions.

Prompts

We also provide prompts for LLMs (for Chinese and English), including zero-shot prompts and few-shot prompts.

Users can use prompts with use_llm in command lines，and specify the number of in-context learning examples with argument example_num.

python run_gectoolkit.py -m $MODEL_NAME -d $ DATASET_NAME --use_llm --example_num $EXAMPLE_NUM

Model name used here should be those from huggingface, such as Qwen/Qwen-7B-chat.

Evaluation

We integrate mainstream evaluation tools for GEC tasks in our evaluation module, including M2Scorer, ERRANT and ChERRANT. Additionally, we also provide scripts for converting and ground truth of some datasets. During the process of training, UnifiedGEC calculate micro-level PRF for the results of models, so if users want to evaluate models in a macro way, they can use this evaluation module.

First, users should use our provided script to convert outputs of the models to the format required by scorers:

python convert.py --predict_file $PREDICT_FILE --dataset $DATASET

Correspondence between datasets and scorers:

数据集	评估工具
CoNLL14、FCE、NLPCC18	M2Scorer
AKCES-GEC、Falko-MERLIN、COWSL2H	ERRANT
MuCGEC	ChERRANT

M2Scorer

Official repository: https://github.com/nusnlp/m2scorer

For English datasets (CoNLL14、FCE)，use M2scorer directly for evaluation：

cd m2scorer
m2scorer/m2scorer predict.txt m2scorer/conll14.gold

For Chinese datasets (NLPCC18)，pkunlp tools for segmentation is required. We also provide converting scripts:

cd m2scorer
python pkunlp/convert_output.py --input_file predict.txt --output_file seg_predict.txt
m2scorer/m2scorer seg_predict.txt m2scorer/nlpcc18.gold

ERRANT

Official repository: https://github.com/chrisjbryant/errant

Usage is referenced from official repository:

cd errant
errant_parallel -orig source.txt -cor target.txt -out ref.m2
errant_parallel -orig source.txt -cor predict.txt -out hyp.m2
errant_compare -hyp hyp.m2 -ref ref.m2

ChERRANT

Official repository: https://github.com/HillZhang1999/MuCGEC

Usage is referenced from official repository:

cd cherrant/ChERRANT
python parallel_to_m2.py -f ../hyp.txt -o hyp.m2 -g char
python compare_m2_for_evaluation.py -hyp hyp.m2 -ref ref.m2

Experiment Results

Models

There are 5 models and 7 datasets across different languages integrated in UnifiedGEC, and there is the best performance of implemented models on Chinese and English datasets:

model	dataset
	CoNLL14(EN)			FCE(EN)			NLPCC18(ZH)			MuCGEC(ZH)
	P	R	F0.5	P	R	F0.5	P	R	F0.5	P	R	F0.5
Levenshtein Transformer	13.5	12.6	13.3	6.3	6.9	6.4	12.6	8.5	10.7	6.6	6.4	6.6
GECToR	52.3	21.7	40.8	36.0	20.7	31.3	30.9	20.9	28.2	33.5	19.1	29.1
Transformer	24.1	15.5	21.7	20.8	15.9	19.6	22.3	20.8	22.0	19.7	9.2	16.0
T5	36.6	39.5	37.1	29.2	29.4	29.3	32.5	21.1	29.4	30.2	14.4	24.8
SynGEC	50.6	51.8	50.9	59.5	52.7	58.0	36.0	36.8	36.2	22.3	26.2	23.6

The best performance of implemented models on datasets of other languages:

model	dataset
	AKCES-GEC(CS)			Falko-MERLIN(DE)			COWSL2H
	P	R	F0.5	P	R	F0.5	P	R	F0.5
Levenshtein Transformer	4.4	5.0	4.5	2.3	4.2	2.5	1.9	2.3	2.0
GECToR	46.8	8.9	25.3	50.8	20.5	39.2	24.4	12.9	20.7
Transformer	44.4	23.6	37.8	33.1	18.7	28.7	11.8	15.0	12.3
T5	52.5	40.5	49.6	47.4	50.0	47.9	53.7	39.1	49.9
SynGEC	21.9	27.6	22.8	32.2	33.4	32.4	9.3	18.8	10.3

Data Augmentation

We conduct experiments on NLPCC18 and CoNLL14 datasets, and simulate low-resource cases by choosing 10% data from datasets (F0.5/delta F0.5):

model	data augmentation methods	dataset
		CoNLL14		NLPCC18
		F0.5	delta	F0.5	delta
Levenshtein Transformer	w/o augmentation	9.5	-	6.0	-
	w/ error patterns	6.4	-3.1	4.9	-1.1
	w/ back-translation	12.5	3.0	5.9	-0.1
GECToR	w/o augmentation	14.2	-	17.4	-
	w/ error patterns	15.1	0.9	19.9	2.5
	w/ back-translation	16.7	2.5	19.4	2.0
Transformer	w/o augmentation	12.6	-	9.5	-
	w/ error patterns	14.5	1.9	9.9	0.4
	w/ back-translation	16.6	4.0	10.4	0.9
T5	w/o augmentation	31.7	-	26.3	-
	w/ error patterns	32.0	0.3	27.0	0.7
	w/ back-translation	32.2	0.5	24.1	-2.2
SynGEC	w/o augmentation	47.7	-	32.4	-
	w/ error patterns	48.2	0.5	34.9	2.5
	w/ back-translation	47.7	0.0	34.6	2.2

Prompts

We use Qwen1.5-14B-chat and Llama2-7B-chat and conduct experiments on NLPCC18 and CoNLL14 datasets (P/R/F0.5):

Setting	Dataset
	CoNLL14			NLPCC18
	P	R	F0.5	P	R	F0.5
zero-shot	48.8	49.1	48.8	24.7	38.3	26.6
few-shot	50.4	50.2	50.4	24.8	39.8	26.8

License

UnifiedGEC uses Apache 2.0 License.

AnKate/UnifiedGEC

UnifiedGEC: Integrating Grammatical Error Correction Approaches for Multi-languages with a Unified Framework

Characterisic

Architecture

Models

Datasets

Quick Start

Installation

Usage

Adding New Datasets

Data Augmentation Module

Prompts

Evaluation

M2Scorer

ERRANT

ChERRANT

Experiment Results

Models

Data Augmentation

Prompts

License