UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at esltreebank.org
Exam questions: dataset/UD_English-ESL/fce-released-dataset/prompts/[folders]/doc[number].xml
Learner answers: dataset/UD_English-ESL/fce-released-dataset/dataset/[folders]/doc[number].xml
Each xml file contains the textual answers for 2 exams written by a English learner. The following are the attribute tags:
- language: native language of the learner
- age: age range of the learner
- score: ??
For each exam:
- question_number
- exam_score
- coded_answer: text content of answer (with tags of FCE error codes)
The details of exams and tags of FCE error codes can be found in dataset/UD_English-ESL/fce-released-dataset/dataset/README
The labeled dataset is built in CoNLL-U format.
Original sentences:
- dataset/UD_English-ESL/data/en_esl-ud-train.conllu
- dataset/UD_English-ESL/data/en_esl-ud-dev.conllu
- dataset/UD_English-ESL/data/en_esl-ud-test.conllu
Corrected sentences:
- dataset/UD_English-ESL/data/corrected/en_cesl-ud-train.conllu
- dataset/UD_English-ESL/data/corrected/en_cesl-ud-dev.conllu
- dataset/UD_English-ESL/data/corrected/en_cesl-ud-test.conllu
The following are the attributes for each word in a sentence:
['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']
- id: index of word in sentence
- form: word
- lemma:
- upostag: POS tag
- xpostag: POS tag
- feats:
- head:
- deprel:
- deps:
- misc:
("_" means null)
Example of the representation of word:
([('id', 1), ('form', 'I'), ('lemma', '_'), ('upostag', 'PRON'), ('xpostag', 'PRP'), ('feats', None), ('head', 3), ('deprel','nsubj'), ('deps', None), ('misc', None)])
To use the data loader, you need to first install the CoNLL-U Parser built by Emil Stenström.
The following is an example to use data_loader:
import data_loader
meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)
train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list
train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list
train_meta.head()
id | doc_id | sent | errors | native_language | age_range | score |
---|---|---|---|---|---|---|
1 | doc2664 | I was <ns type="S"><i>shoked</i><c>shocked</c>... | {'S': 2, 'RV': 1} | Russian | 21-25 | 21.0 |
2 | doc648 | I am very sorry to say it was definitely not a... | {'MT': 1, 'RT': 1} | French | 26-30 | 38.0 |
3 | doc1081 | Of course, I became aware of her feelings sinc... | {'AGQ': 1} | Spanish | 16-20 | 36.0 |
4 | doc724 | I also suggest that more plays and films shoul... | {'FV': 1, 'RV': 1} | Japanese | 21-25 | 33.0 |
5 | doc567 | Although my parents were very happy <ns type="... | {'FD': 1, 'RT': 1, 'RJ': 1, 'MT': 1} | Spanish | 31-40 | 34.0 |
train_data[0]
id | form | lemma | upostag | xpostag | feats | head | deprel | deps | misc | meta_id |
---|---|---|---|---|---|---|---|---|---|---|
1 | I | _ | PRON | PRP | None | 3 | nsubj | None | None | 1 |
2 | was | _ | VERB | VBD | None | 3 | cop | None | None | 1 |
3 | shoked | _ | ADJ | JJ | None | 0 | root | None | None | 1 |
4 | because | _ | SCONJ | IN | None | 8 | mark | None | None | 1 |
5 | I | _ | PRON | PRP | None | 8 | nsubj | None | None | 1 |
6 | had | _ | AUX | VBD | None | 8 | aux | None | None | 1 |
7 | alredy | _ | ADV | RB | None | 8 | advmod | None | None | 1 |
8 | spoken | _ | VERB | VBN | None | 3 | advcl | None | None | 1 |
9 | with | _ | ADP | IN | None | 10 | case | None | None | 1 |
10 | them | _ | PRON | PRP | None | 8 | nmod | None | None | 1 |
11 | and | _ | CONJ | CC | None | 8 | cc | None | None | 1 |
12 | I | _ | PRON | PRP | None | 14 | nsubj | None | None | 1 |
13 | had | _ | AUX | VBD | None | 14 | aux | None | None | 1 |
14 | taken | _ | VERB | VBN | None | 8 | conj | None | None | 1 |
15 | two | _ | NUM | CD | None | 16 | nummod | None | None | 1 |
16 | autographs | _ | NOUN | NNS | None | 14 | dobj | None | None | 1 |
17 | . | _ | PUNCT | . | None | 3 | punct | None | None | 1 |
Dumped files are under ./preprocessed/[name]/.
- meta.csv: the same format as the above variable "train_meta"
- [number].csv: the same format as the above variable "train_data[0]"