/covid-world-scraper

scrapers for the pitch map

Primary LanguagePythonISC LicenseISC

NOTE: This project is no longer actively maintained.

COVID World Scrapers

Overview

This project provides a command-line tool for scraping COVID-19 data from countries around the world.

The scrapers target the subset of countries that offer coronavirus data at the level of administrative units (provinces, states, territories within a country).

Organizations such as Johns Hopkins University are a better resource for comprehensive country-wide figures.

Install

  • Install a recent version of Firefox.
  • Download and unpack Geckodriver to a location on the PATH (or update PATH env variable to include its location).
  • Install the covid-world-scraper command-line tool:
pip install git+https://github.com/biglocalnews/covid-world-scraper#egg=covid-world-scraper

Use

The covid-world-scraper command-line tool lets you download the current data for a country by supplying one or more 3-letter ISO country codes.

# List available country scrapers
covid-world-scraper -l

# Run all scrapers at once, sequentially
covid-world-scraper --all

# Run selected countries (Brazil, Germany, Pakistan)
# by passing in one or more 3-letter ISO country codes
covid-world-scraper bra deu pak

# To see other available CLI options
covid-world-scraper --help

By default, data for each country is written to a covid-world-scraper-data folder in a user's home directory. This location can be updated using the --cache-dir flag:

covid-world-scraper --cache-dir=/tmp/some-other-name bra

For each country, scrapers download and store one or more file artifacts in a raw directory. These files may be screenshots, HTML, Excel files, etc. Data extracted from these raw sources are stored in a processed directory for each country. Files in both directories are named based on the UTC runtime of the scraper.

Below is an example showing file artifacts generated by the Pakistan scraper on two consecutive days in June 2020.

The types of raw files saved for a given country vary widely and reflect the different ways each country posts it data.

covid-world-scraper-data/pak
├── processed
│   ├── 20200627T0126Z.csv
│   └── 20200628T1705Z.csv
└── raw
    ├── 20200627T0126Z.html
    ├── 20200627T0126Z.png
    ├── 20200627T0126Z.txt
    ├── 20200628T1705Z.html
    ├── 20200628T1705Z.png
    └── 20200628T1705Z.txt

Alerts

The scraper can send status alerts about scrapers to Slack. This requires:

  • Creating a Slack app and integrating it into a workspace
  • Obtaining a Slack App API token
  • Creating environment variables for the API key and target channel

See the Python slackclient docs for details on setting up a Slack app, integrating with a workspace, and obtaining an API key.

# e.g., in ~/.bash_profile or ~/.bashrc
export COVID_WORLD_SLACK_API_KEY=YOUR_API_KEY
export COVID_WORLD_SLACK_CHANNEL=channel-name

After completing the above steps, use the --alert command-line option to send Slack alerts when scrapers are run:

# Scrape all countries and send alerts to Slack
covid-world-scraper --alert --all

Credits

This project relies on country code data from the GeoNames project.