We have created the largest human annotated Named Entity Recognition (NER) data set for Nepali available to date. Highlights:
- EverestNER covers five named entities - Person Name, Location, Organization, Event and Date.
- EverestNER produces high quality annotations through clear annotation guidelines.
- EverestNER has 24,587 entities, 308,353 tokens corresponding to 15,798 sentences.
- We split the EverestNER data set into EverestNER-train and EverestNER-test. These standard data sets, therefore, become the first benchmark data sets for evaluating Nepali NER systems.
- We report a comprehensive evaluation of state-of-the-art Neural and Transformer models using these data sets. This is the first study to apply BERT model for mining Named Entities for Nepali.
- We also discuss the remaining challenges for discovering NEs for Nepali (see our paper below).
Data | Articles | Sentences | Tokens | Avg. Sent.Len | LOC | ORG | PER | EVT | DAT |
---|---|---|---|---|---|---|---|---|---|
Train | 847 | 13,848 | 268,741 | 19.40 | 5,148 | 4,756 | 7,707 | 312 | 3,394 |
Test | 149 | 1,950 | 39,612 | 20.31 | 809 | 715 | 1,115 | 59 | 572 |
Total | 996 | 15,798 | 308,353 | 19.51 | 5,957 | 5,471 | 8,822 | 371 | 3,966 |
The EverestNER data set is divided into train (EverestNER-train) and test (EverestNER-test) sets. Each data set has character level as well as token leven annotations. Please read our paper to get more information on this.
-
Character Level
- Train: EverestNER-train-char.txt
- Test: EverestNER-test-char.txt
-
Token Level
- Train: EverestNER-train-bio.txt
- Test: EverestNER-test-bio.txt
Model comparision on EverestNER-test. Models (a) baseline (rule-based), (b) BLSTM CRF, and (c) multilingual BERT:
Model | Pre. | Rec. | F1-micro |
---|---|---|---|
Baseline (Rule-based) | 0.71 | 0.55 | 0.62 |
BLSTM-CRF-wc.ft | 0.89 | 0.74 | 0.81 |
BERT-bbmu | 0.87 | 0.84 | 0.85 |
Performance evaluation of the best performing model (BERT-bbmu) per named entities:
Model | Pre. | Rec. | F1 | Support |
---|---|---|---|---|
PER | 0.90 | 0.85 | 0.88 | 1115 |
LOC | 0.85 | 0.80 | 0.82 | 809 |
ORG | 0.85 | 0.83 | 0.84 | 715 |
EVT | 0.46 | 0.42 | 0.44 | 59 |
DAT | 0.91 | 0.91 | 0.91 | 572 |
Non-commercial purposes only. For commercial usages, permissions must be taken from the authors and the relevant parties. See the contact address below.
Unless required by applicable law or agreed to in writing, software and data distributed here is on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
If you use the EverestNER data set, please cite our publication:
@inproceedings{niraula2022named,
title={Named Entity Recognition for Nepali: Data Sets and Algorithms},
author={Niraula, Nobal and Chapagain, Jeevan},
booktitle={The International FLAIRS Conference Proceedings},
volume={35},
year={2022}
}
Feel free to contact nobal @AT nowalab .DOT com for any inquiries regarding this work.
Nepali Shabdakosh - https://nepalishabdakosh.com