- Project overview
- Output
- Development
Data is often not collected by Black communities when it is needed the most. We have compiled a list of all of the states that have shared data on COVID-19 infections and deaths by race and those who have not. This effort is to extract this data from websites to track disparities COVID-19 deaths and cases for Black people.
The scrapers are written in Python, and call out to binaries for PDF data extraction and OCR.
The default outputs are date-stamped CSV and XLSX files in the
output/
subdirectory.
Feature Name | Description |
---|---|
Location | The geographic entity for which this row provides data. These can be states, counties, or cities. |
Date published | The date as of which the underlying data was published by the reporting entity. |
Date/time of data pull | The date/time the D4BL team ran the code to retrieve the data was retrie. |
Total Cases | The number of confirmed COVID-19 cases reported for the location. |
Total Deaths | The number of deaths attributed to COVID-19 reported for the location. |
Count Cases Black/AA | The number of confirmed COVID-19 cases corresponding to “Black or African American” or “Non-Hispanic Black” reported for the location. |
Count Deaths Black/AA | The number of confirmed COVID-19 deaths corresponding to “Black or African American” or “Non-Hispanic Black” reported for the location. |
Percentage of Cases Black/AA | The percentage of COVID-19 cases (of those with race reported) corresponding to “Black or African American” or “Non-Hispanic Black”. |
Percentage of Deaths Black/AA | The percentage of COVID-19 deaths (of those with race reported) corresponding to “Black or African American” or “Non-Hispanic Black” |
Percentage includes unknown race? | Logical (True/False) indicator of whether the Percentage of Cases Black/AA field includes COVID-19 cases with race/ethnicity unknown |
Percentage includes Hispanic Black? | Logical (True/False) indicator of whether the Percentage of Deaths Black/AA field includes COVID-19 deaths with race/ethnicity unknown |
Count Cases Known Race | The number of cases in which race was reported and, hence, “known” |
Count Deaths Known Race | The number of deaths in which race was reported and, hence, “known” |
Percentage of Black/AA population (Census data) | The percentage of “Black or African American alone” individuals for the region, computed using 2013-2018 American Community Survey fields B02001_003E and B02001_001E. |
Note: older output files may not include all of the fields.
Ensure Python 3.8 is installed.
Fork and cloning the repository. Then change directory to the root of
the repo (./COVID19\_tracker\_data\_extraction
). The subsequent
steps need to be run from there.
If you do not already have pip
, install it:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
Note: This is a recommended way to keep packages for this
repo. You can choose to use a different environment manager such as
conda
, or even install this globally if you prefer.
pip install virtualenv
For example,
virtualenv d4blcovid19tracker
source d4blcovid19tracker/bin/activate
Adding an alias to easily enter into this environment can be
helpful. For example, in your ~/.zshrc
or ~/.bashrc
:
enter_d4bl() {
cd /path/to/COVID19_tracker_data_extraction/workflow/python
source /path/to/d4blcovid19tracker/bin/activate
}
pip install -r requirements.txt
We provide a script wrapping brew
to install the required non-Python
binaries on Macs.
./setup_mac.sh
For Linux distributions that use apt
and snap
, you can install the
prereqs with these commands:
apt install tesseract-ocr
apt install chromium-browser
apt install chromium-chromedriver
snap install chromium
apt install openssl
We use pre-commit
to lint and format files added to your local git
index on a git commit.
This will run before the commit takes place,
so if there are errors, the commit will not take place.
pre-commit install
From the workflow/python
subdirectory, the main script is
run_scraper.py
.
There are quite a few options:
$ python run_scrapers.py --help
usage: run_scrapers.py [-h] [--list_scrapers] [--work_dir DIR] [--output FILE] [--log_file FILE] [--log_level LEVEL] [--no_log_to_stderr]
[--stderr_log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG}] [--google_api_key KEY] [--github_access_token KEY] [--census_api_key KEY]
[--enable_beta_scrapers] [--start_date START_DATE] [--end_date END_DATE]
[SCRAPER [SCRAPER ...]]
Run some or all scrapers
positional arguments:
SCRAPER List of scrapers to run, or all if omitted
optional arguments:
-h, --help show this help message and exit
--list_scrapers List the known scraper names
--work_dir DIR Write working outputs to subdirectories of DIR.
--output FILE Write output to FILE (must be -, or have csv or xlsx extension)
--log_file FILE Write logs to FILE
--log_level LEVEL Set log level for the log_file to LEVEL
--no_log_to_stderr Disable logging to stderr.
--stderr_log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG}
Set log level for stderr to LEVEL
--google_api_key KEY Provide a key for accessing Google APIs.
--github_access_token KEY
Provide a token for accessing Github APIs.
--census_api_key KEY Provide a key for accessing Census APIs.
--enable_beta_scrapers
Include beta scrapers when not specifying scrapers manually.
--start_date START_DATE
If set, acquire data starting on the specified date in ISO format.
--end_date END_DATE If set, acquire data through the specified date in ISO format, inclusive.
Depending on the scrapers invoked, you need to provide keys. Here are links on how to register for them:
- Google API key: Required for
Colorado
. - Github access token: Required for
NewYorkCity
. - Census API key: Recommended for all scrapers.
There are no beta scrapers at this time, and the date range options are not broadly implemented yet.
The currently implemented scrapers are:
$ python run_scrapers.py --list_scrapers
Known scrapers:
Alabama
Alaska
Arizona
Arkansas
California
CaliforniaLosAngeles
CaliforniaSanDiego
CaliforniaSanFrancisco
Colorado
Connecticut
Delaware
Florida
FloridaMiamiDade
FloridaOrange
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
NewHampshire
NewMexico
NewYork
NewYorkCity
NorthCarolina
NorthDakota
Ohio
Oklahoma
Oregon
Pennsylvania
RhodeIsland
SouthCarolina
SouthDakota
Tennessee
Texas
TexasBexar
Utah
Vermont
Virginia
Washington
WashingtonDC
WestVirginia
Wisconsin
WisconsinMilwaukee
Wyoming