dqqcasia/cgec

Python

CGEC

Codes for our work on Chinese Grammatical Error Correction.

CGEC Dataset

NLPCC 2018

Split	# Sent	# Token src.	# Token tgt.
Train	1.2 M	23.7 M	25.0 M
Dev	5.000	84.5 K	92.9 K
Test	2.000	58.9 K	-

CGEC Performance

System	Method	P	R	F_0.5
YouDao (Kai et. al. 2018)	35.24	18.64	29.91
AliGM (Zhou et. al. 2018)	41.00	13.75	29.36
BLCU (Ren et. al. 2018)	47.63	12.56	30.57
PKU (Zhao et. al. 2020)	44.36	22.18	36.97
CASIA	CSeq2Edit	43.98	22.65	37.01

CSC Dataset

SIGHAN CSC Dataset

test set	# sent	# chars	# spelling erros (chars)	character set	genre
SIGHAN 2015	1,100	33,711	715	traditional	second-language learning
SIGHAN 2014	1,062	53,114	792	traditional	second-language learning
SIGHAN 2013	2,000	143,039	1,641	traditional	second-language learning

training set	# sent	# chars	# spelling erros (chars)	character set	genre
SIGHAN 2015	3,174	95,220	4,237	traditional	second-language learning
SIGHAN 2014	6,526	324,342	10,087	traditional	second-language learning
SIGHAN 2013	350	17,220	350	traditional	second-language learning

Synthetic training dataset

corpus	# sent	# chars	# spelling erros (chars)	character set	genre
Synthetic training dataset (Wang et. al. 2018)	271,329	12M	382,702	simplified	news

CSC Performance

System	False Positive Rate	Detection Accuracy	Detection Precision	Detection Recall	Detection F1	Correction Accuracy	Correction Precision	Correction Recall	Correction F1
CAS (Zhang et. al. 2015)	11.6	70.1	80.3	53.3	64.0	69.2	79.7	51.5	62.5
FASPell(Hong et. al. 2019)	-	74.2	67.6	60.0	63.5	73.7	66.6	59.1	62.6
Confusion-set(Wang et. al. 2019)	-	-	66.8	73.1	69.8	-	71.5	59.5	64.9
Soft-Masked BERT(Zhang et. al. 2020)	-	80.9	73.7	73.2	73.5	77.4	66.7	66.2	66.4