Open COVID-19 Dataset

This repo contains free datasets of historical data related to COVID-19.

Explore the data

A simple visualization tool was built to explore the Open COVID-19 datasets: https://open-covid-19.github.io/explorer/.

Use the data

The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. Each dataset has a version with all historical data, and another version with only the latest daily data. The datasets currently available are:

Dataset	CSV URL	JSON URL
Data	Latest, Historical	Latest, Historical

Understand the data

The columns of the main dataset are:

Name	Description	Example
Date	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-21
CountryCode	ISO 3166-1 code of the country	CN
CountryName	American English name of the country	China
RegionCode	(Optional) ISO 3166-2 code of the region	HB
RegionName	(Optional) American English name of the region	Hubei
Confirmed	Total number of cases confirmed after positive test	67800
Deaths	Total number of deaths from a positive COVID-19 case	3139
Latitude	Floating point representing the geographic coordinate	30.9756
Longitude	Floating point representing the geographic coordinate	112.2707
Population	Total count of humans living in the region	TODO

For countries were both country-level and region-level data is available, the entry which has a null value for the RegionCode and RegionName columns indicates country-level aggregation. Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value.

Forecasting

There is also a short-term forecasting dataset available in the output folder as data_forecast.csv, which has the following columns:

ForecastDate,Date,CountryCode,CountryName,RegionCode,RegionName,Estimated,Confirmed 2020-03-21,2020-03-08,AE,United Arab Emirates,,,48.193,45

Name	Description	Example
ForecastDate	ISO 8601 date (YYYY-MM-DD) of last available datapoint	2020-03-21
Date	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-25
CountryCode	ISO 3166-1 code of the country	CN
CountryName	American English name of the country	China
RegionCode	(Optional) ISO 3166-2 code of the region	HB
RegionName	(Optional) American English name of the region	Hubei
Estimated	Total number of cases estimated from forecasting model	66804.567
Confirmed	Total number of cases confirmed after positive test	67800

Backwards compatibility

Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:

Analyze the data

You may also want to load the data and perform your own analysis on it. You can find Jupyter Notebooks in the analysis repository with examples of how to load and analyze the data.

You can even use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/analysis.

Source of data

The world data comes from the daily reports at the ECDC portal. The XLS file is downloaded and parsed using scrapy and pandas.

Data for Chinese regions and Italy (see #12) comes from the DXY scraped dataset, which is parsed using pandas.

The data is automatically crawled and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like region-level coordinates.

Before updating the outputs, data is spot-checked using various data sources including data from local authorities like Italy's ministry of health and the reports from WHO.

Why another dataset?

This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset has intermittently experiencing maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.

Update the data

To update the contents of the output folder, first install the dependencies:

# Install Ghostscript
apt-get install -y ghostscript
# Install Python dependencies
pip install -r requirements.txt

Then run the following scripts to update all datasets:

sh input/update_data.sh

m-deck/data