The subset of Estonian Grammatical Error Correction Corpus (EstGEC) that contains L2 learner writings error-annotated in the M2 format.
This subcorpus currently consists of 263 texts and 3,790 sentences retrieved from the Estonian Interlanguage Corpus compiled at the Tallinn University. The texts include narrative/descriptive and argumentative writings as well as informal and formal letters representing various proficiency levels. EstGEC-L2 material has been divided into a test and development set that can be used for evaluating and improving Estonian automated correction tools. The test set comprises 2,029 and the dev set 1,761 sentences, distributed between the proficiency levels as follows:
- A2 – 937 (495 in test set);
- B1 – 963 (504 in test set);
- B2 – 1,091 (534 in test set);
- C1 – 796 (495 in test set).
Previously, the texts had been manually error-tagged in the CoNLL-U format, indicating the error type, scope, and correction in the field for miscellaneous token attributes. The annotation has been converted to the M2 format (the conversion script can be found here) using an adapted version of the ERRANT tagset. Whereas the previous format was limited to one error annotation per sentence, up to two new annotation versions have been added. Considering the two-phase annotation, each text has been reviewed by at least three annotators.
There are 12 main and 18 combined error types in the error classification (see tables 1 and 2). The prefix indicates whether a word, phrase or punctuation mark should be replaced ('R:'), is missing ('M:') or unnecessary ('U:'). In our tagset, we do not distinguish the part-of-speech (POS) of the replaced, added or deleted word. For example, all word choice errors are indicated by the tag 'R:LEX'. This has helped to reduce the complexity of the error categorization, while allowing us to classify all errors and avoid the 'OTHER' tag. There are numerous edit and POS combinations, since the edit types often overlap (e.g., spelling errors co-occur with inflection and word choice errors) and the POS of the original word and its replacement can differ.
Another important difference to the English M2 annotation is that we allow overlapping error scope if a token-level error occurs within a word order error, e.g., one of the words contains a spelling error. Therefore, it is possible to detect token-level corrections even if word order has not been edited.
Furthermore, orthography errors have been divided into capitalization and whitespace errors. Inflection errors are marked as nominal (noun, adjective, pronoun and numeral) or verb form errors without a further distinction, i.e., these include case, number, agreement, tense, mood and other errors in the choice of inflected form.
Table 1. Main error types
Error tag | Meaning | Example |
R:SPELL | Spelling error | soobib -> sobib |
R:CASE | Capitalization error | Juuli -> juuli |
R:WS | Whitespace error | igalpool -> igal pool |
R:NOM:FORM | Nominal form error | kallis -> kallid (Sing -> Plur) |
R:VERB:FORM | Verb form error | tegeleb -> tegeles (Pres -> Past) |
R:LEX | Word choice error | ilusasti -> ilus (ADV -> ADJ) |
R:PUNCT | Punctuation choice error | Kohtumiseni. -> Kohtumiseni! |
R:WO | Word order error | üldse polnud -> polnud üldse |
M:LEX | Missing word(s) | See väga ilus linn -> See on väga ilus linn |
U:LEX | Unnecessary word(s) | auto välimus on punane -> auto on punane |
U:PUNCT | Unnecessary punctuation | laupäeval, kell 10 -> laupäeval kell 10 |
Table 2. Combined error types
Error tag | Meaning | Example |
R:SPELL:CASE | Spelling and capitalization error | Vannalinnas -> vanalinnas |
R:WS:SPELL | Whitespace and spelling error | liimik koht -> lemmikkoht |
R:WS:CASE | Whitespace and capitalization error |
Kontserdi majas -> kontserdimajas |
R:WS:NOM:FORM | Whitespace and nominal form error |
kogupäev -> kogu päeva (Nom -> Gen) |
R:WS:NOM:FORM:SPELL | Whitespace, nominal form and spelling error |
politika uudiseid -> poliitikauudised (Par -> Nom) |
R:WS:NOM:FORM:CASE | Whitespace, nominal form and capitalization error |
cv online -> CV-Online’i (Nom -> Gen) |
R:NOM:FORM:SPELL | Nominal form and spelling error | ekskursioni ~ ekskursiooni -> ekskursioonile (Gen/Par -> All) |
R:NOM:FORM:CASE | Nominal form and capitalization error |
tartu -> Tartut (Nom -> Par) |
R:NOM:FORM:SPELL:CASE | Nominal form, spelling and capitalization error |
Sobrad ~ Sõbrad -> sõpradega (Nom -> Com) |
R:VERB:FORM:SPELL | Verb form and spelling error | kaisin ~ käisin -> käin (Past -> Pres) |
R:VERB:FORM:SPELL:CASE | Verb form, spelling and capitalization error |
jstume ~ istume -> Istusime (Pres -> Past) |
R:LEX:SPELL | Word choice and spelling error | laksin ~ läksin -> käisin |
R:LEX:CASE | Word choice and capitalization error |
võimalikult -> Võimalik (ADV -> ADJ) |
R:LEX:NOM:FORM | Word choice and nominal form error |
muusikaid -> muusikastiilid (Par -> Nom) |
R:LEX:VERB:FORM | Word choice and verb form error |
(mina) oli -> (mina) käisin (3rd person -> 1st person) |
R:LEX:WO | Word choice error affects word order |
läbi interneti -> interneti kaudu |
R:LEX:WS | Word choice and whitespace error |
oma teist -> teineteist |
R:WO:NOM:FORM | Word order error affects the choice of nominal form |
pealinn Islandil -> Islandi pealinn (Ade -> Gen) |
- The dataset has been used to evaluate the GEC toolkit developed in collaboration by the language technology groups of the University of Tartu and the Tallinn University. The L1 subset of the EstGEC corpus is being annotated at the University of Tartu.
- The M2 Scorer adapted for EstGEC can be found here.
- Conference presentations: