Basic cleaning and reformatting of celebrity death data from various sources including Wikpedia, DH Montgomery, and Kaggle.
The pipeline has been consolidated into a single notebook.
Steps:
-
For summary pages of celebrity deaths by month (and year), batch query Wikipedia API in groups of
$n < 50$ : the Wikipedia API allows up to 50 queries at a time and API etiquette implicitly limits rate to maximum 1 query/second. Each query will have hundreds of entries --> requires tens of batched queries for the complete dataset. -
Combine the results of all queries and regex to parse out fields, columns, and variables of interest.
-
Collect all resulting entry names (each observation should be a person/figure) and batch query again for wiki page size, date of birth, date of death, etc. Each query will have tens of thousands of entries --> requires thousands of batched queries.