Probabilistic vs Deep Learning based approaches for narrow domain NER in Spanish

Experiments from NER task in Spanish language using two corpora: CoNLL-2002 and Mexican news

Data

The datasets used are described as follows:

CoNLL-2002 dataset

The CoNLL-2002 dataset is in the language Spanish and it has four classes under tagging schema IOB (Inside/Outside/Beginning). A Spanish sentence example with IOB schema is:

El presidente Andrés Manuel López Obrador estará viajando a la ciudad de Puebla de Zaragoza este Miércoles .

PERSON: Andrés-PER-B Manuel-PER-I López-PER-I Obrador-PER-I
LOCATION: Puebla-LOC-B de-LOC-I Zaragoza-LOC-I

No.	Class	IOB Schema	Description
1	ORG	B-ORG, I-ORG	Organizations names
2	PER	B-PER, I-PER	People names
3	LOC	B-LOC, I-LOC	Location names
4	MISC	B-MISC, I-MISC	Miscellaneous

The CoNLL-2002 dataset was split into four partitions.

No.	Partition	Original	Split	Tokens	Tags
1	Test A	1915	2177	9646	8
2	Test B	1517	1848	9086	8
3	Train	8323	9947	26099	8
4	Ensemble	11755	13972	31405	8

In headings, Original are the original sentences length. Split were splitted to length 50 (tokens).

CoNLL-2002 dataset distribution

The plot shows the tags/classes distribution from dataset under IOB schema. In the chart, the x-axis shows the four classes and the y-axis describe the number of entities per class.

CoNLL-2002 Sentences histogram

The histogram is used to identify the sentences-length from the dataset.

Mx-news dataset

The Mx-news dataset was built on the political news domain using 250 documents, it is in the language Spanish and it has seventeen classes. One annotator used tagging schema IOBES (Inside/Outside/Beginning/End/Single) to labeled this corpus. Using the previous Spanish sentence example with IOBES schema:

El presidente Andrés Manuel López Obrador estará viajando a la ciudad de Puebla de Zaragoza este Miércoles .

PERSON: Andrés-PER-B Manuel-PER-I López-PER-I Obrador-PER-E
LOCATION: Puebla-LOC-B de-LOC-I Zaragoza-LOC-E
DATE: Miércoles-DAT-S

No.	Class	IOB Schema	Description
1	PER	B-PER, I-PER, E-PER, S-PER	People names, aliases and abbreviations
2	ORG	B-ORG, I-ORG, E-ORG, S-ORG	Organizations, institutions
3	DAT	B-DAT, I-DAT, E-DAT, S-DAT	Dates on different formats
4	TIT	B-TIT, I-TIT, E-TIT, S-TIT	Title or position of persons
5	GPE	B-GPE, I-GPE, E-GPE, S-GPE	Country names, states, cities, municipalities
6	PEX	B-PEX, I-PEX, E-PEX, S-PEX	Political party names, aliases and abbreviations
7	TIM	B-TIM, I-TIM, E-TIM, S-TIM	Time expresions
8	FAC	B-FAC, I-FAC, E-FAC, S-FAC	Facility names
9	EVT	B-EVT, I-EVT, E-EVT, S-EVT	Event names
10	ADD	B-ADD, I-ADD, E-ADD, S-ADD	Addresses expressions, URLs and Twitter users
11	MNY	B-MNY, I-MNY, E-MNY, -----	Monetary amounts
12	DOC	B-DOC, I-DOC, E-DOC, S-DOC	Documents, laws, rules
13	PRO	B-PRO, I-PRO, E-PRO, S-PRO	Product names, brands, application names
14	PRC	B-PRC, I-PRC, E-PRC, -----	Percentage expressions
15	DEM	B-DEM, -----, E-DEM, S-DEM	Geographical or racial origin of people
16	AGE	B-AGE, I-AGE, E-AGE, -----	People age
17	LOC	B-LOC, I-LOC, E-LOC, S-LOC	Locations about regions, rivers, lakes

The Mx-news dataset was split into four partitions.

No.	Partition	Original	Split	Tokens	Tags
1	Split I	1295	1666	7628	63
2	Split II	1295	1677	7726	63
3	Split III	1297	1661	7664	63
4	Ensemble	3888	5004	13273	65

In headings, Original are the original sentences length. Split were splitted to length 50 (tokens).

Mx-news dataset distribution (seventeen classes)

The plot shows the tags/classes distribution from dataset under IOBES schema. In the chart, the x-axis shows the seventeen classes and the y-axis describe the number of entities per class.

Mx-news dataset distribution (seventeen classes)

Mx-news dataset was reduced to four classes: ORG, PER, LOC, MISC (the fourteen left) is labeled under IOB (Inside/Beginning/Out) schema. The plot shows the tags/classes distribution from dataset under IOB schema. In the chart, the x-axis shows the four classes and the y-axis describe the number of entities per class.

Mx-news Sentences histogram

The histogram is used to identify the sentences-length from the dataset. The Natural Language Toolkit was used to split sentences.

ELMo word embeddings

The ELMo word embeddings were trained using elmoformanylangs Python library.

Download CoNLL-2002 ELmo embeddings

Download Mx-news ELMo embeddings

Requirements

The Python libraries used:

elmoformanylangs==0.0.2
graphviz==0.10.1
Keras==2.2.4
Keras-Applications==1.0.7
keras-contrib==2.0.8
Keras-Preprocessing==1.0.9
matplotlib==3.0.3
numpy==1.16.2
seqeval==0.0.10
scikit-learn==0.20.3
sklearn-crfsuite==0.3.6
tensorflow==1.13.1

orlandxrf/spanish-ner

Probabilistic vs Deep Learning based approaches for narrow domain NER in Spanish

Data

CoNLL-2002 dataset

CoNLL-2002 dataset distribution

CoNLL-2002 Sentences histogram

Mx-news dataset

Mx-news dataset distribution (seventeen classes)

Mx-news dataset distribution (seventeen classes)

Mx-news Sentences histogram

ELMo word embeddings

Requirements