dsfsi/covid19za

NICD Provincial Data Scraper

lrossouw opened this issue · 7 comments

I will probably build a web scraper for the NICD data probably using R.

Probably see it working as follows running in an hourly cron:

  1. It will pull the covid19za repo and ensure it has all latest commits.
  2. Read in the relevant CSVs
  3. Then check for new pages in the format https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-%d-%B-%Y/. E.g. https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-6-nov-2020/. This check will be done on the last date in the CSV + 1 (to avoid missing days).
  4. It will scrape that page capturing the three tables (cases, tests, deaths & recoveries).
  5. Sanity checks
  6. Subject to passing checks, update the CSVs in the local repo and push directly to this repository. Not keen to do automate the whole PR process too.

Anyone have any concerns?

Sanity checks to stop the process on any of:

  • Exceptions
  • Non-numeric data
  • Ensure numbers are strictly increasing.
  • Province name checks
  • Unrealistic increases? 10% on cumulative per day?

Thoughts?

I've done this. An example of an automated commit in my fork:
lrossouw@af1390e

I've also just delete a couple of weeks data on my fork to see how well it does in updating data.

My test above was successful processing 2 weeks data. Stopping 3 times due to NICD site changes but not once committing incorrect information. Information was identical other than the source url.

Closed by b0adcaf

Thanks @lrossouw this is so awesome. We can then reduce chances of error.

NP. It should post within 15min or so of the page going up on NICD's site.

This is really awesome. Well done.

Thanks, what I can also mention that is if the process fails on a particular day due to NICD messing with the url, or format of the page, someone can still capture manually. The scraper will then notice this and move on to the next day.