Datasets for Building the Indonesian NER
(Dataset untuk Membangun Named Entity Recognizer (NER) untuk Bahasa Indonesia)
This repository contains resources of a project named Modified DBpedia Entities Expansion (MDEE) (Alfina, et al., 2017). We share:
- Three NER datasets used in the experiments explained in the paper (in the main folder), each consists of 20,000 sentences, along with the gold standard.
- Three NER datasets, as the revised version of the three NER datasets in the main folder (in the revised-20k folder).
- The original names in Indonesian DBpedia (in "original-dbpedia" folder).
- Two versions of DBpedia explained in the paper (in "expanded-dbpedia" folder): MDEE, and MDEE_Gazetteer
- A dataset of 48,957 sentences named SINGGALANG (in "singgalang" folder). We used expanded DBpedia of MDEE_Gazetteer to label this dataset.
The NER Datasets
The datasets conforms with the dataset format of Stanford-NER
Four named entity classes are used:
- "Person" for person names
- "Place" for place names
- "Organisation" for organization names
- "O" for others
List of dataset in main folder:
1. dataset created using original DEE (Alfina et al., 2016), file name: 20k-dee.txt, with properties file: 20k-dee.prop
2. dataset created using Modified DEE (Alfina et al., 2017), file name: 20k-mdee.txt, with properties file: 20k-mdee.prop
3. dataset created using Modified DEE plus gazetteer (Alfina et al., 2017), file name: 20k-mdee-gazz.txt, with properties file: 20k-mdee-gazz.prop
4. A gold standard created by Luthfi, et al (2014)
Each version of NER datasets consist of 20,000 sentences from Wikipedia articles in the Indonesian language that were labeled automatically.
The SINGGALANG dataset
We provide a new NER dataset in this repository, named SINGGALANG. The specifications of this dataset are:
- The number of sentences: 48,957
- Generated using expanded DBpedia of MDEE_Gazett (the best version of those three expanded DBpedia)
How to cite these works
The dataset may be used for free, but if you want to publish paper/publication using the dataset, please cite these publications:
- The DEE corpus:
- The MDEE corpus:
- The Gold Standard
How to create NER model using the dataset?
We suggest you to use the Stanford NER library.
The steps to create NER model using Stanford NER library are as follows:
-
Download Stanford NER
-
Download the dataset and its properties file (file with .prop extension)
-
Use Stanford NER classifier to create the model.
For example:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.propI recommend to increase the heap size so you can train the dataset on computer with limited RAM. Add option like "-Xmx1024m" on the command, for example:
java -Xmx1024m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.prop
if this still doesn't work, increase the number. For example: "-Xmx8000m". This works for me :)
Let say this step will create a NER model file named "idner-model-20k-mdee.ser.gz"
-
Create or use a testing dataset. Lets say the file name is "testing.txt"
-
Evaluate the NER model using Stanford NER library
For example:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier idner-model-20k-mdee.ser.gz -testFile testing.txt