- 2021-04-02 This repository is now no longer maintained. What does this mean?
- Some of the states, which required manual extraction, will no longer be updated.
- The automatic extraction, for the rest of the states, will continue to be run daily and updated in the update-data branch. However, we do not guarantee the accuracy of this data as they are no longer checked.
- The processed data, gathering all sources across the states, is no longer updated.
- 2021-01-28 Version 1 Release This is the release related to our upcoming peer-reviewed age paper, where we use age-specific mobility data to estimate the epidemic in the USA by accounting for age-specific heterogeneity.
One may directly get:
- the age-specific mortality data used in the paper here
- the crude estimates of the COVID-19 cases and mortality across common age strata here
The user may directly find the latest update of the age-specific mortality by date, age and location in
data/processed/latest/DeathsByAge_US.csv
We aim to update the data at least once a week. The data set currently includes 44 U.S. states and 2 metropolitan areas. The locations are listed in the table below.
The easiest way for reproducibility is using docker
. A Dockerfile
is in the repository.
Run:
sudo apt-get install docker # for linux. For mac you can use something like brew. In any case,
# you need to install docker onto your machine
docker build -t usaage .
docker run --rm -t -d --name usaage_container -v $(pwd):/code usaage
This will keep a docker container running in the background, which you can inspect using docker ps.
Now all the development can be done in the container and you can edit the code as usual locally (changes will be synced to the docker container since we made it share folders using the flag -v
). You might need to use Remote-SSH in the VSCODE IDE for convenience. You can also just attach a shell onto the container using docker exec -it usaage_container /bin/bash
You can check that everything works by running make all in the container.
The code is divided into 2 parts: First, the extraction of the COVID-19 mortality counts data from Department of Health websites. Second, the processing of the extracted data to create a complete time series of age-specific COVID-19 mortality counts for every location.
- Python version >= 3.6.1
- Python libraries:
fitz
PyMuPDF
pandas
pyjson
beautifulsoup4
requests
selenium
- R version >= 4.0.2
- R libraries:
data.table
ggplot2
scales
gridExtra
tidyverse
rjson
readxl
reshape2
To extract, run
$ make files
This will get you the latest data in data/$DATE
.
To process, run
$ Rscript scripts/process.data.R
This will get you a csv file for every state with variables age, date, daily.deaths and (state) code in data/processed/$DATE/
.
The main entry point is make files
.
make files
will execute the files
task in Makefile
, which currently is composed only of the script ./download_files.sh
. This script follows the following steps:
- Set a date,
$date
, in the local environment - Create new folders in
data
andpdfs
for the$date
. - Run the following scripts:
scripts/age_extraction.py
to extract the locations for which data are available in CSV, XLSX or JSON format.- a series of
GET
requests to the web API. They download CSVs made available by the DoH directly. scripts/extraction_try.py
, which downloads data that are in webpage, XLSX or PDF format.python scripts/get_nm.py
to get New Mexico data.
Depending on the data format made available by the DoH, we do the following:
PDFs: We use fitz
in order to read data within PDFs and save them to JSON or CSV format.
CSVs, XLSX, JSON: We download the data directly.
Static Webpages (HTML): We save the HTML and extract the data using BeautifulSoup
, and save them in JSON format.
Dynamic Webpages (Dashboard): We use selenium
to render a webpage and switch to the right page. Then, if the data is stored in the source code, we find their path or css, extract them and save them to a JSON
format. Otherwise, if the webpage can be saved as a PDFs, we use BeautifulSoup
to download the webpage in a PDFs format and fitz
to extract the data within PDFs. If we cannot use either of the latter options, we take a screenshot of the webpage, and extract the data manually.
Screenshots/PNGs: To record the data published in the dynamic webpages
We reconstruct time series for every location and age band, therefore all extracted data need to have the same age bands. If the DoH changes the reported age bands at time
- the old age bands can be used to find the new age bands, then we find the mortality counts by the old age bands for every data from
$t$ before processing. - the old age bands cannot be used to find the new age bands, then we truncate the time series:
$t$ becomes the first day of the time series and all data extracted before$t$ are ignored.
-
Read the data
-
If a complete time series records of age-specific COVID-19 attributable death burden is available
- Use only the last data available
- Every state has its own processing function depending on the data format
-
If daily snapshots of age-specific COVID-19 attributable death burden are available
- Use every data ever extracted
- if CSV or XLSX: the state has its own processing function
- if JSON: common processing function
-
-
Ensure that the mortality counts are strictly increasing
- some DoH updates indicated a decreasing mortality count from one day to the next.
- In this case, we set the mortality count on the earliest day to match the mortality count on the most recent day.
-
Find daily deaths
- some days had missing data, usually either because no updates were reported, because the webpage failed or because the URL of the website had mutated.
- The missing daily mortality count were imputed, assuming a constant increase in daily mortality count between days with data.
-
Check that the reconstructed cumulative deaths on the last day match the ones reported in the latest data.
The script that acts as a spine for those four stages is utils/obtain.data.R
. Functions for stage 1 are in utils/read.daily-historical.data.R
and utils/read.json.data.R
. Functions from stage 2, 3 are in utils/summary_functions.R
. Function for stage 4 is in utils/sanity.check.processed.data.R
.
After reconstructing the time series, we make final adjustements for analysis:
-
Modify the age bands boundaries from the ones declared by the Department of Health, such that they reflect the closest age bands in the set, A = { [0-4], [5-9], ..., [75-79], [80-84], [85+] }. For example, age band [0-17] becomes [0-19] and age band [61-65].
-
Keep only days that match closely with JHU overall mortality counts.
Both data set, adjusted and non adjusted are available, DeathsByAge_US_adj.csv
and DeathsByAge_US.csv
.
This table includes a complete list of all sources ever used in the data set. We acknowledge and are grateful to U.S. state Departments of Health for making the primary data available at the following sources:
State | Date record start | Link(s) | Notes |
---|---|---|---|
Alabama | 2020-05-03 | link | dashboard updated daily and replaced; no historical archive |
Alaska | 2020-06-09 | link | metadata updated daily and replaced; no historical archive |
Arizona | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
California | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
Colorado | 2020-03-23 | (1) link until 2020-08-20, (2) link since 2020-08-20 | (1) metadata updated daily; full time series; died in 2020-08-20; (2) dashboard updated daily and replaced; no historical archive |
Connecticut | 2020-04-05 | link | metadata updated daily; full time series |
Delaware | 2020-05-12 | link | dashboard updated daily and replaced; no historical archive |
District of Columbia | 2020-04-13 | link | metadata updated daily; full time series |
Florida | 2020-03-27 | link | daily report; with historical archive |
Hawaii | 2020-09-18 | link | dashboard updated weekly and replaced |
Georgia | 2020-04-27 | link | metadata updated daily and replaced; no historical archive |
Idaho | 2020-05-13 | (1) link, (2) link | dashboard updated daily and replaced; no historical archive ; (1) died on 2020-09-04 |
Illinois | 2020-05-14 | link | dashboard updated daily and replaced; no historical archive |
Indiana | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
Iowa | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
Kansas | 2020-05-13 | link | dashboard updated Monday, Wednesday and Friday, and replaced; no historical archive |
Kentucky | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
Louisiana | 2020-05-12 | link | dashboard updated daily except on Saturday and replaced; no historical archive |
Maine | 2020-03-12 | link | metadata updated daily; full time series |
Maryland | 2020-05-14 | link | dashboard updated daily and replaced; no historical archive |
Massachusetts | 2020-04-20 | link until 2020-08-11 and link since | (1) daily report, with historical archive; (2) weekly report, with historical archive |
Michigan | 2020-03-21 | (1) data/req/michigan weekly.csv and (2) link |
(1) data requested to the DoH (2) dashboard updated daily and replaced; no historical archive |
Minnesota | 2020-05-21 | link | weekly report, with historical archive |
Mississippi | 2020-04-27 | link | dashboard updated daily and replaced; no historical archive |
Missouri | 2020-05-13 | (1)link and (2)link | dashboard updated daily and replaced; no historical archive |
Nevada | 2020-06-07 | link | dashboard updated daily and replaced; no historical archive |
New Hampshire | 2020-06-07 | (1)link until 2021-01-08, and (2)link since 2021-01-08 | dashboard updated daily and replaced; no historical archive |
New Jersey | 2020-05-25 | link | dashboard updated daily and replaced; no historical archive |
New Mexico | 2020-05-25 | link | daily written report; with history archive |
New York City | 2020-04-14 | link, link since 2020-05-18, link since 2020-11-08 | report / csv updated daily, with history archive |
North Carolina | 2020-05-20 | link | dashboard updated daily and replaced; no historical archive |
North Dakota | 2020-05-14 | link | dashboard updated daily and replaced; no historical archive |
Oklahoma | 2020-05-13 | link | dashboard updated daily and replaced; no historical archive |
Oregon | 2020-06-05 | link | dashboard updated dashboard updated on Monday-Friday and sometimes on Saturday and replaced; no historical archive |
Pennsylvania | 2020-06-07 | (1)link and (2)link | dashboard updated daily and replaced; no historical archive |
Rhode Island | 2020-06-01 | link | metadata updated weekly and replaced; no historical archive |
South Carolina | 2020-05-14 | link | dashboard updated on Tuesday and Friday; no historical archive |
Tennessee | 2020-04-09 | link | metadata updated daily; full time series |
Texas | 2020-05-06 | (1) link until 2020-09-24, (2) link since 2020-09-24 | metadata updated daily and replaced; no historical archive |
Utah | 2020-06-17 | link | dashboard updated daily and replaced; no historical archive |
Vermont | 2020-05-13 | (1) link until 2020-09-03, (2) link since 2020-09-03 | dashboard updated daily and replaced; no historical archive; (1) does not report mortality by age since 2020-09-03 |
Virginia | 2020-04-21 | link | metadata updated daily; full time series |
Washington | 2020-06-08 | link | dashboard updated daily and replaced; no historical archive |
Wisconsin | 2020-03-15 | (1) link until 2020-10-19, (2) link since 2020-10-19 | metadata updated daily; full time series |
Wyoming | 2020-09-22 | link | dashboard updated daily and replaced; no historical archive |
- Yu Chen - Department of Mathematics, Imperial College London
- Michael Hutchinson - Department of Statistics, Oxford
- Vidoushee Jogarah - Mary Lister McCammon Fellow, Department of Mathematics, Imperial College London
- MĂ©lodie Monod - Department of Mathematics, Imperial College London
- Oliver Ratmann - Department of Mathematics, Imperial College London
- Harrison Zhu - Department of Mathematics, Imperial College London
- Martin McManus - Department of Mathematics, Imperial College London
This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by Imperial College London on behalf of its COVID-19 Response Team. Copyright Imperial College London 2020.
Imperial makes no representation or warranty about the accuracy or completeness of the data nor that the results will not constitute in infringement of third-party rights. Imperial accepts no liability or responsibility for any use which may be made of any results, for the results, nor for any reliance which may be placed on any such work or results.
Attribute the data as the "COVID-19 Age specific Mortality Data Repository by the Imperial College London COVID-19 Response Team", and the urls sepecified above.
We acknowledge the support of the EPSRC through the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning at Imperial and Oxford.
This research was partly funded by the The Imperial College COVID-19 Research Fund.