google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

PythonCC-BY-4.0

Issues

Dataset
#9 opened a year ago by saramoeini20
4
Any plan to release extra code?
#8 opened 3 years ago by lxsyz
1
Dataset used to train the corruption model
#7 opened 3 years ago by GokulNC
1
Synthetical datasets for other languages (e.g German, Spanish, French)
#6 opened 3 years ago by BogdanDidenko
1
Please collaborate.
#1 opened 3 years ago by MariasStory
0
Questions about reproducing the results
#4 opened 3 years ago by MichaelCaohn
2
Reduction of computational requirements for generating C4_200M
#2 opened 4 years ago by palasso
10