/en-worldwide-newswire

An English NER dataset built from foreign newswire

Primary LanguagePython

The English Worldwide Newswire dataset, as introduced in Do "English" Named Entity Recognizers Work Well on Global Englishes? (EMNLP 2023) by Shan et. al. https://arxiv.org/abs/2404.13465

Alex Shan (azshan@cs.stanford.edu) is the correspondence author and maintainer of this repository.

This dataset is composed of ~1100 news articles from around the world, sourced from non-Western newswire. This dataset is specifically designed to exclude Western sourced texts and focuses on uncommon contexts of the English language. We encourage authors to benchmark their English NER models on this dataset to explore the efficacy of modern models on unseen contexts. Below is a detailed breakdown of article origins.

Overview of dataset

  • 1075 hand-annotated English newswire articles from local sources around the world (bucketed into Asia, Africa, Latin America, the Middle East, and Indigenous Commonwealth (Oceania + Canada)).
  • 700,000 tokens
  • Created in collaboration with Datasaur NLP (https://datasaur.ai) and MLTwist (https://mltwist.com)
  • 9 class labels: Date, Person, Location, Facility, Organization, Miscellaneous, Money, NORP, and Product. A more detailed overview of the definition for each class can be found in the appendix of the ArXiv paper.
  • BIOES format: We also tag each token with its class and position, denoting whether the token is the start, intermediate, or end of a named entity.

To process the dataset, check out StanfordNLP's Stanza library which contains the dataset preparation script: https://github.com/stanfordnlp/stanza/blob/main/stanza/utils/datasets/ner/prepare_ner_dataset.py

South America: 94
Argentina 20
Bolivia 3
Chile 12
Colombia 10
Ecuador 10
Guyana 3
Paraguay 13
Peru 10
Uruguay 5
Venezuela 8
Central and North America: 178
Costa Rica 20
Cuba 15
El Salvador 20
Honduras 14
Mexico 29
Nicaragua 20
Panama 20
Indigenous Canadian 40
Africa: 265
General 65
Pan-Africa 20
Algeria 20
Ghana 20
Kenya 23
Mauritius 20
Egypt 22
Ethiopia 9
Namibia 28
South Africa 38
Asia: 347
General 14
China 104
Japan 15
India 71
Korea 37
Taiwan 26
Malaysia 11
Bangladesh 31
Thailand 27
Mongolia 11
Middle East: 167
Oman 12
Jordan 21
Israel 20
Iran 16
UAE 17
Saudi Arabia 27
Pakistan 2
Qatar 16
Kuwait 36
Oceania: 48
Indigenous Australia 28
Indigenous New Zealand 20

Repo organization

Inside the original_articles directory, you can find the complete collection of our raw text data before the labeling process. In the procesed_annotated directory, you may find the complete collection of our annotations of the original data. Within the directory, you will find .tsv files containing BIOES-format labeled data. In the other directories, you may find the complete collection of our annotations and annotator metadata computed on the Datasaur platform. To access the labeled data, use the REVIEW subdirectory of each folder. Each line delimits a separate token that is tab-delimited between its text and corresponding label. The file names come in the form of <country>_<newswire_company>_<id>.txt.tsv. To understand where these countries are within the geographic buckets, refer to the /regions.txt file for each prefix conversion.

If you use this dataset, please use the following citation:

@inproceedings{Shan_2023,
   title={Do “English” Named Entity Recognizers Work Well on Global Englishes?},
   url={http://dx.doi.org/10.18653/v1/2023.findings-emnlp.788},
   DOI={10.18653/v1/2023.findings-emnlp.788},
   booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
   publisher={Association for Computational Linguistics},
   author={Shan, Alexander and Bauer, John and Carlson, Riley and Manning, Christopher},
   year={2023} }