CNIT623-Native-Language-Identification-On-English-Learner-Dataset: A Jupyter Notebook repository from tomelf

Dataset

Introduction

UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at esltreebank.org

File format

Raw data

Exam questions: dataset/UD_English-ESL/fce-released-dataset/prompts/[folders]/doc[number].xml

Learner answers: dataset/UD_English-ESL/fce-released-dataset/dataset/[folders]/doc[number].xml

Each xml file contains the textual answers for 2 exams written by a English learner. The following are the attribute tags:

language: native language of the learner
age: age range of the learner
score: ??

For each exam:

question_number
exam_score
coded_answer: text content of answer (with tags of FCE error codes)

The details of exams and tags of FCE error codes can be found in dataset/UD_English-ESL/fce-released-dataset/dataset/README

Labeled data in CoNLL-U format

The labeled dataset is built in CoNLL-U format.

Original sentences:

dataset/UD_English-ESL/data/en_esl-ud-train.conllu
dataset/UD_English-ESL/data/en_esl-ud-dev.conllu
dataset/UD_English-ESL/data/en_esl-ud-test.conllu

Corrected sentences:

dataset/UD_English-ESL/data/corrected/en_cesl-ud-train.conllu
dataset/UD_English-ESL/data/corrected/en_cesl-ud-dev.conllu
dataset/UD_English-ESL/data/corrected/en_cesl-ud-test.conllu

The following are the attributes for each word in a sentence:
['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']

id: index of word in sentence
form: word
lemma:
upostag: POS tag
xpostag: POS tag
feats:
head:
deprel:
deps:
misc:

("_" means null)

Example of the representation of word:
([('id', 1), ('form', 'I'), ('lemma', '_'), ('upostag', 'PRON'), ('xpostag', 'PRP'), ('feats', None), ('head', 3), ('deprel','nsubj'), ('deps', None), ('misc', None)])

Data Loader

To use the data loader, you need to first install the CoNLL-U Parser built by Emil Stenström.

The following is an example to use data_loader:

import data_loader

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)

train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list

train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list

train_meta.head()

id	doc_id	sent	errors	native_language	age_range	score
1	doc2664	I was <ns type="S"><i>shoked</i><c>shocked</c>...	{'S': 2, 'RV': 1}	Russian	21-25	21.0
2	doc648	I am very sorry to say it was definitely not a...	{'MT': 1, 'RT': 1}	French	26-30	38.0
3	doc1081	Of course, I became aware of her feelings sinc...	{'AGQ': 1}	Spanish	16-20	36.0
4	doc724	I also suggest that more plays and films shoul...	{'FV': 1, 'RV': 1}	Japanese	21-25	33.0
5	doc567	Although my parents were very happy <ns type="...	{'FD': 1, 'RT': 1, 'RJ': 1, 'MT': 1}	Spanish	31-40	34.0

train_data[0]

id	form	lemma	upostag	xpostag	feats	head	deprel	deps	misc	meta_id
1	I	_	PRON	PRP	None	3	nsubj	None	None	1
2	was	_	VERB	VBD	None	3	cop	None	None	1
3	shoked	_	ADJ	JJ	None	0	root	None	None	1
4	because	_	SCONJ	IN	None	8	mark	None	None	1
5	I	_	PRON	PRP	None	8	nsubj	None	None	1
6	had	_	AUX	VBD	None	8	aux	None	None	1
7	alredy	_	ADV	RB	None	8	advmod	None	None	1
8	spoken	_	VERB	VBN	None	3	advcl	None	None	1
9	with	_	ADP	IN	None	10	case	None	None	1
10	them	_	PRON	PRP	None	8	nmod	None	None	1
11	and	_	CONJ	CC	None	8	cc	None	None	1
12	I	_	PRON	PRP	None	14	nsubj	None	None	1
13	had	_	AUX	VBD	None	14	aux	None	None	1
14	taken	_	VERB	VBN	None	8	conj	None	None	1
15	two	_	NUM	CD	None	16	nummod	None	None	1
16	autographs	_	NOUN	NNS	None	14	dobj	None	None	1
17	.	_	PUNCT	.	None	3	punct	None	None	1

Dumped files are under ./preprocessed/[name]/.