NICD Provincial Data Scraper

Question

NICD Provincial Data Scraper

lrossouw opened this issue 4 years ago · 7 comments

lrossouw commented 4 years ago

I will probably build a web scraper for the NICD data probably using R.

Probably see it working as follows running in an hourly cron:

It will pull the covid19za repo and ensure it has all latest commits.
Read in the relevant CSVs
Then check for new pages in the format https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-%d-%B-%Y/. E.g. https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-6-nov-2020/. This check will be done on the last date in the CSV + 1 (to avoid missing days).
It will scrape that page capturing the three tables (cases, tests, deaths & recoveries).
Sanity checks
Subject to passing checks, update the CSVs in the local repo and push directly to this repository. Not keen to do automate the whole PR process too.

Anyone have any concerns?

Sanity checks to stop the process on any of:

Exceptions
Non-numeric data
Ensure numbers are strictly increasing.
Province name checks
Unrealistic increases? 10% on cumulative per day?

Thoughts?

Answer 1 · 2020-11-12T11:20:44.000Z

I've done this. An example of an automated commit in my fork:
lrossouw@af1390e

I've also just delete a couple of weeks data on my fork to see how well it does in updating data.

Answer 2 · 2020-11-12T14:04:35.000Z

My test above was successful processing 2 weeks data. Stopping 3 times due to NICD site changes but not once committing incorrect information. Information was identical other than the source url.

Answer 3 · 2020-11-12T19:11:19.000Z

Closed by b0adcaf

Answer 4 · 2020-11-12T19:56:05.000Z

Thanks @lrossouw this is so awesome. We can then reduce chances of error.

Answer 5 · 2020-11-12T19:58:43.000Z

NP. It should post within 15min or so of the page going up on NICD's site.

Answer 6 · 2020-11-13T04:33:38.000Z

This is really awesome. Well done.

Answer 7 · 2020-11-13T07:14:38.000Z

Thanks, what I can also mention that is if the process fails on a particular day due to NICD messing with the url, or format of the page, someone can still capture manually. The scraper will then notice this and move on to the next day.