Wikidata-TypeRec

This is a dataset for the training of a type classifier on Wikidata entity types. It is derived from Wikidata-Disamb by Cetoli et el. (2019).

The train set has 100,000 samples. The test and the dev set have 10,000 samples each. Each sample has four values:

  • A natural language sentence s
  • A mention m (referring to an entity e)
  • The entity e (to which the mention m refers)
  • The entity type t of the entity e

Dataset Parts

There are three versions for each part of the dataset:

  • small: small part of the dataset for debugging
  • medium: larger part of the dateset for tests and debugging
  • full: full dataset

Entity Types

The types are a predefined set derived from Wikidata Concept Monitor taxonomy:

Wikidata item Type
http://www.wikidata.org/entity/Q215627 person
http://www.wikidata.org/entity/Q163875 cardinal
http://www.wikidata.org/entity/Q838948 work of art
http://www.wikidata.org/entity/Q13442814 article in scholarly journal
http://www.wikidata.org/entity/Q571 book
http://www.wikidata.org/entity/Q618123 geographical feature
http://www.wikidata.org/entity/Q43229 organization
http://www.wikidata.org/entity/Q811979 architectural structure
http://www.wikidata.org/entity/Q16521 taxon
http://www.wikidata.org/entity/Q1656682 event
http://www.wikidata.org/entity/Q83620 thoroughfare
other other