gazetteer-collection

Overview

This repository has a set of gazetteers used in a system to improve the performance of a neural named entity recognition system by adding input features that indicate a word is part of a name. The system in described in two papers:

  • Chan Hee Song, Dawn Lawrie, Tim Finin and James Mayfield, Gazetteer Generation for Neural Named Entity Recognition, Proceedings of the 34th International FLAIRS Conference, AAAI Press, May 2020. LINK

  • Chan Hee Song, Dawn Lawrie, Tim Finin and James Mayfield, Improving Neural Named Entity Recognition with Gazetteers, arXiv forthcoming, 2020. LINK

The gazetteer files were generated by searching Wikidata via SPARQL queries sent to the public query server to retrieve both canonical names (e.g., Johns Hopkins University) and aliases (e.g., JHU, Johns Hopkins, Hopkins) in each of the languages studied. The first step was to construct a mapping from our project’s 15 target types to Wikidata’s fine-grained type system. Our types included four common core types (person, organization, geopolitical entity (GPE), location) and eleven additional types (airport, chemical, commercial organization, computer hardware/software, event, facility, government building, money, political organization, title, vehicle, weapon).

The mapping for some types was simple: person corresponds to Wikidata’s Q5 and vehicle to Q42889. Others hada complex mapping that eliminate Wikidata subtypes that seemed too specialized (e.g., lunar craters and ice rumples from Wikidata’s geographic object) or allow us to retrieve more entity names given the public server’s one-minute query timeout

The initial name lists were filtered by type-dependant regular expressions to delete names we thought to be unhelpful (e.g., Francis of Assisi as a person because historical figures are unlikely to be mentioned in our targeted genres), remove Wikipedia artifacts (e.g., parentheticals), and eliminate punctuation, names that were too short or too long, and duplicate names. Although one could say that these changes bias the gazetteers, there is no reason not engineer a gazetteer in a way that is most helpful for the data. Wikidata is still being used in an automated way since we are relying on available labels.

We produced additional lists for Russian using a custom script that generates type-sensitive inflected and familiar forms of canonical names and aliases. For an extreme example, the Russian name for the personVladimir Vladimirovich Putin (Владимир ВладимировичПутин) produces morethan 100 variations. The result is a collection of 96 gazetteer files more than 16M entity names, 4.2M for English, 2.1M for Russian and 584K for Chinese with an additional 8.7M Russian names produced by our morphological scripts. We kept the gazetteers for canonical names, aliases, and inflected forms separate to facilitate experimentation.

Content

  • gazetteers.tgz: a tar file of the gazetteers, organized by type, language and wheter the entries come from wikidata names, wikidata aliases or (for russian) inflectored forms of wikidata names.

  • samples: a directory with a files for each gazetteer with a random sample of 20 names

  • code: a directory with some of the scripts used to generate the gazetteeers

For more information

For more information, contact the authors as