This repo contains free datasets of historical data related to COVID-19.
A simple visualization tool was built to explore the Open COVID-19 datasets: https://open-covid-19.github.io/explorer/.
The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. Each dataset has a version with all historical data, and another version with only the latest daily data. The datasets currently available are:
Dataset | CSV URL | JSON URL |
---|---|---|
Data | Latest, Historical | Latest, Historical |
The columns of the main dataset are:
Name | Description | Example |
---|---|---|
Date | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-21 |
CountryCode | ISO 3166-1 code of the country | CN |
CountryName | American English name of the country | China |
RegionCode | (Optional) ISO 3166-2 code of the region | HB |
RegionName | (Optional) American English name of the region | Hubei |
Confirmed | Total number of cases confirmed after positive test | 67800 |
Deaths | Total number of deaths from a positive COVID-19 case | 3139 |
Latitude | Floating point representing the geographic coordinate | 30.9756 |
Longitude | Floating point representing the geographic coordinate | 112.2707 |
Population | Total count of humans living in the region | TODO |
For countries were both country-level and region-level data is available, the
entry which has a null value for the RegionCode
and RegionName
columns
indicates country-level aggregation. Please note that, sometimes, the
country-level data and the region-level data come from different sources so
adding up all region-level values may not equal exactly to the reported
country-level value.
There is also a short-term forecasting dataset available in the output folder as data_forecast.csv, which has the following columns:
ForecastDate,Date,CountryCode,CountryName,RegionCode,RegionName,Estimated,Confirmed 2020-03-21,2020-03-08,AE,United Arab Emirates,,,48.193,45
Name | Description | Example |
---|---|---|
ForecastDate | ISO 8601 date (YYYY-MM-DD) of last available datapoint | 2020-03-21 |
Date | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-25 |
CountryCode | ISO 3166-1 code of the country | CN |
CountryName | American English name of the country | China |
RegionCode | (Optional) ISO 3166-2 code of the region | HB |
RegionName | (Optional) American English name of the region | Hubei |
Estimated | Total number of cases estimated from forecasting model | 66804.567 |
Confirmed | Total number of cases confirmed after positive test | 67800 |
Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:
You may also want to load the data and perform your own analysis on it. You can find Jupyter Notebooks in the analysis repository with examples of how to load and analyze the data.
You can even use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/analysis.
The world data comes from the daily reports at the ECDC portal.
The XLS file is downloaded and parsed using scrapy
and pandas
.
Data for Chinese regions and Italy (see #12) comes from the
DXY scraped dataset, which is parsed using pandas
.
The data is automatically crawled and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like region-level coordinates.
Before updating the outputs, data is spot-checked using various data sources including data from local authorities like Italy's ministry of health and the reports from WHO.
This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset has intermittently experiencing maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.
To update the contents of the output folder, first install the dependencies:
# Install Ghostscript
apt-get install -y ghostscript
# Install Python dependencies
pip install -r requirements.txt
Then run the following scripts to update all datasets:
sh input/update_data.sh