/Analysis-Wikipedia-Entities

Goal: To understand the Wikipedia dataset, especially the entity info boxes. Task: We have taken the Wikipedia dump. Our aim is to extract information about various entity types. The steps for this task are as follows: 1. Given the Wikipedia dump, gather all the pages from Wikipedia with Info boxes on them. 2. Find the set of all possible entity types on Wikipedia 3. Find the set of all possible attributes that can be associated with any entity type on Wikipedia. 4. From a few values of these attributes, infer the data type of these attributes as one of the following: String, set of strings, duration, number, set of durations, date, other. 5. Find various units that can be used to express the value of a numeric attribute. E.g., for “height” attribute of “person” entities, the units could be “cms, inches” 6. For numeric attributes, find typical ranges (using the most popular unit). E.g., For person entities, the age attribute should have the range as 0-150 years. 7. For attributes which are semantically similar but have different names used across different entities of the same type, merge them. E.g., Automatically identify that the attribute “birthdate” is the same as “bdate”.

Primary LanguagePython

Stargazers

No one’s star this repository yet.