/doc-gec

Scripts for document-level grammatical error correction.

Primary LanguagePythonMIT LicenseMIT

Document-level Grammatical Error Correction

This repository contains the scripts used to generate the document-level evaluation data in:

Zheng Yuan and Christopher Bryant. 2021. Document-level grammatical error correction. In Proceedings of the Sixteenth Workshop on Innovative Use of NLP for Building Educational Applications. Kyiv, Ukraine.

Overview

We evaluate on the FCE, CoNLL-2014 and BEA-2019 shared task datasets.

The FCE and BEA-2019 datasets are available from the BEA-2019 shared task website, while the CoNLL-2014 data is available from the CoNLL-2014 shared task website.

Direct download links:

FCE: download
CoNLL-2014: download
BEA-2019: download

Preprocessing

The specific files we use are located in:
FCE: fce/json/fce.test.json
CoNLL-2014: conll14st-test-data/noalt/official-2014.[01].sgml
BEA-2019: wi+locness/json/[ABCN].dev.json

Since the main json_to_doc_m2.py script takes a json file as input, the first step is to convert the CoNLL-2014 sgml to json format. This can be achieved using the following command:

python3 sgml_to_json.py <sgml_dir> -out conll2014.test.json

Where <sgml_dir> is the path to a directory that contains both official-2014.0.sgml and official-2014.1.sgml files. Note that this script also filters a small number of rare edits (e.g. edits longer than 40 characters).

For BEA-2019, the only preprocessing is to combine the different level json files into a single file:

cat [ABCN].dev.json > ABCN.dev.json

There is no preprocesing for the FCE.

Usage

The main json_to_doc_m2.py script takes a BEA-2019 style json file as input and produces an M2 file as output. It is a variant of the json_to_m2.py script released in the BEA-2019 shared task. The only prerequisite is ERRANT. We used ERRANT v2.1 in our paper. The command is run as follows:

python3 json_to_doc_m2.py <json_file> -out <output_m2> -gold -docs

Where <json_file> is the input json file outlined in the Preprocessing step, and <output_m2> is the name of the output M2 file. The -gold flag ensures the output file uses human annotated edits (rather than -auto edits extracted by ERRANT), while the -docs flag generates the document-level M2 file. If the -docs flag is not specified, the command will produce a sentence-level M2 file.