/data

Crowd-sourced COVID-19 data

Primary LanguagePython

Open COVID-19 Dataset

This repo contains free datasets of historical data related to COVID-19.

Explore the data

A simple visualization tool was built to explore the Open COVID-19 datasets: https://open-covid-19.github.io/explorer/.

Explorer Screenshot

Use the data

The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. Each dataset has a version with all historical data, and another version with only the latest daily data. The datasets currently available are:

Dataset CSV URL JSON URL
Data Latest, Historical Latest, Historical

Understand the data

The columns of the main dataset are:

Name Description Example
Date ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-21
CountryCode ISO 3166-1 code of the country CN
CountryName American English name of the country China
RegionCode (Optional) ISO 3166-2 code of the region HB
RegionName (Optional) American English name of the region Hubei
Confirmed Total number of cases confirmed after positive test 67800
Deaths Total number of deaths from a positive COVID-19 case 3139
Latitude Floating point representing the geographic coordinate 30.9756
Longitude Floating point representing the geographic coordinate 112.2707
Population Total count of humans living in the region TODO

For countries were both country-level and region-level data is available, the entry which has a null value for the RegionCode and RegionName columns indicates country-level aggregation. Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value.

Forecasting

There is also a short-term forecasting dataset available in the output folder as data_forecast.csv, which has the following columns:

ForecastDate,Date,CountryCode,CountryName,RegionCode,RegionName,Estimated,Confirmed 2020-03-21,2020-03-08,AE,United Arab Emirates,,,48.193,45

Name Description Example
ForecastDate ISO 8601 date (YYYY-MM-DD) of last available datapoint 2020-03-21
Date ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-25
CountryCode ISO 3166-1 code of the country CN
CountryName American English name of the country China
RegionCode (Optional) ISO 3166-2 code of the region HB
RegionName (Optional) American English name of the region Hubei
Estimated Total number of cases estimated from forecasting model 66804.567
Confirmed Total number of cases confirmed after positive test 67800

Backwards compatibility

Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:

Analyze the data

You may also want to load the data and perform your own analysis on it. You can find Jupyter Notebooks in the analysis repository with examples of how to load and analyze the data.

You can even use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/analysis.

Source of data

The world data comes from the daily reports at the ECDC portal. The XLS file is downloaded and parsed using scrapy and pandas.

Data for Chinese regions and Italy (see #12) comes from the DXY scraped dataset, which is parsed using pandas.

The data is automatically crawled and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like region-level coordinates.

Before updating the outputs, data is spot-checked using various data sources including data from local authorities like Italy's ministry of health and the reports from WHO.

Why another dataset?

This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset has intermittently experiencing maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.

Update the data

To update the contents of the output folder, first install the dependencies:

# Install Ghostscript
apt-get install -y ghostscript
# Install Python dependencies
pip install -r requirements.txt

Then run the following scripts to update all datasets:

sh input/update_data.sh