Open COVID-19 Dataset

This repository contains datasets of daily time-series data related to COVID-19, including state/province data for over 30 countries.

Explore the data


A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer:	If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset:
You can also check out the great work of @quixote79, a MapBox-powered interactive map site:	Experience clean, clear graphs with smooth animations thanks to the work of @jmullo:

If you are using this data, feel free to open an issue and let us know so we can give you a call-out here.

Use the data

The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. data.csv has a version with all historical data, and another version with only the latest daily data. All other datasets only have either historical or the latest data. The datasets available from this project are:

Dataset	CSV URL	JSON URL
Data	Latest, Historical	Latest, Historical
Metadata	Latest	Latest
Minimal	Historical	Historical
Weather	Historical	Historical
Mobility	Historical	Historical
Response	Historical	Historical
Forecast	Latest	Latest
Categories	Historical	Historical

You should use the files linked above instead of anything in the output subfolder via the Raw Github server, since the files under the output subfolder are subject to change in incompatible ways with no prior notice.

You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.

Google Colab

You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/data.

R

If you prefer R, then this is all you need to do to load the historical data:

data <- read.csv("https://open-covid-19.github.io/data/data.csv")

Python

In Python, you need to have the package pandas installed to get started:

import pandas
data = pandas.read_csv("https://open-covid-19.github.io/data/data.csv")

jQuery

Loading the JSON file using jQuery can be done directly from the output folder, this code snippet loads all historical data into the data variable:

$.getJSON("https://open-covid-19.github.io/data/data.json", data => { ... }

Powershell

You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest data for Australia:

Invoke-WebRequest 'https://open-covid-19.github.io/data/data_latest.csv' | ConvertFrom-Csv | `
    where Key -eq 'AU' | select Date,CountryName,Confirmed,Deaths

Understand the data

Data

Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment. The columns of data.csv are:

Name	Description	Example
Date*	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-21
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	CN_HB
CountryCode	ISO 3166-1 code of the country	CN
CountryName	American English name of the country, subject to change	China
RegionCode	(Optional) ISO 3166-2 or NUTS 2 code of the region	HB
RegionName	(Optional) American English name of the region, subject to change	Hubei
Confirmed**	Total number of cases confirmed after positive test	67800
Deaths**	Total number of deaths from a positive COVID-19 case	3139
Latitude	Floating point representing the geographic coordinate	30.9756
Longitude	Floating point representing the geographic coordinate	112.2707
Population	Total count of humans living in the region	58500000

* Date used is reporting date, which generally lags a day from the actual date and is subject to timezone adjustments. Whenever possible, dates consistent with the ECDC daily reports are used.

** Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported. For example, US states where deaths are not being reported have null values.

The CountryName and RegionName values are subject to change. You may use them for labels in your application, but you should not assume that they will remain the same in future updates. Instead, use CountryCode and RegionCode to perform joins with other data sources or for filtering within your application.

Metadata

Non-temporal data related to countries and regions. The columns of metadata.csv are:

Name	Description	Example
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
CountryCode	ISO 3166-1 code of the country	CN
CountryName	American English name of the country, subject to change	China
RegionCode	(Optional) ISO 3166-2 or NUTS 2 code of the region	HB
RegionName	(Optional) American English name of the region, subject to change	Hubei
Latitude	Floating point representing the geographic coordinate	30.9756
Longitude	Floating point representing the geographic coordinate	112.2707
Population	Total count of humans living in the region	58500000

Minimal

There is a data_minimal.csv with a subset of the columns from data.csv but otherwise identical information:

Name	Description	Example
Date*	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-30
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
Confirmed**	Total number of cases confirmed after positive test	6447
Deaths**	Total number of deaths from a positive COVID-19 case	133

* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.

** Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported. For example, US states where deaths are not being reported have null values.

Weather

Daily weather information from nearest station reported by NOAA. The columns of weather.csv are:

Name	Description	Example
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_MI
Date	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-30
Station	Identifier for the weather station	USC00206080
Distance	[kilometers] Distance between the location coordinates and the weather station	28.693
MinimumTemperature*	[celsius] Recorded hourly minimum temperature	1.7
MaximumTemperature*	[celsius] Recorded hourly maximum temperature	19.4
Rainfall*	[millimeters] Rainfall during the entire day	51.0
Snowfall*	[millimeters] Snowfall during the entire day	0.0

* Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.

Mobility

Google's Mobility Reports are presented in CSV form as mobility.csv with the following columns:

Name	Description	Example
Date	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-25
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
TransitStations	Percentage change in visits to transit station locations	-15
RetailAndRecreation	Percentage change in visits to retail and recreation locations	-15
GroceryAndPharmacy	Percentage change in visits to grocery and pharmacy locations	-15
Parks	Percentage change in visits to park locations	-15
Residential	Percentage change in visits to residential locations	-15
Workplaces	Percentage change in visits to workplace locations	-15

Response

Summary of a government's response, including a stringency index, collected from University of Oxford:

Name	Description	Example
Date	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-25
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
SchoolClosing	[0-3] Schools are closed	2
WorkplaceClosing	[0-3] Workplaces are closed	2
CancelPublicEvents	[0-3] Public events have been cancelled	2
PublicTransportClosing	[0-3] Public transport is not operational	0
PublicInformationCampaigns	[0-2] Government has launched public information campaigns	1
RestrictionsOnInternalMovement	[0-3] Travel within country is restricted	1
InternationalTravelControls	[0-3] International travel is restricted	3
FiscalMeasures	[USD] Value of fiscal stimuli, including spending or tax cuts	20449287023
MonetaryMeasures	[%] Value of interest rate	-0.75
EmergencyInvestmentInHealthCare	[USD] Emergency funding allocated to healthcare	500000
InvestmentInVaccines	[USD] Emergency funding allocated to vaccine research	100000
TestingFramework	[0-3] Country-wide COVID-19 testing policy	1
ContactTracing	[0-2] Country-wide contact tracing policy	1
StringencyIndex	[0-100] Overall stringency index	71.43

For more information about each field and how the overall stringency index is computed, see the Oxford COVID-19 government response tracker.

Note: Keys which correspond to a region-level datapoint always have the same value as the country-level datapoint, since the tracked government measures are at the country level.

Forecasting

There is also a short-term forecast dataset available in the output folder as data_forecast.csv, which has the following columns:

Name	Description	Example
ForecastDate	ISO 8601 date (YYYY-MM-DD) of last known datapoint	2020-03-21
Date*	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-25
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
Estimated**	Total number of cases estimated from forecasting model	66804.567
Confirmed	Total number of cases confirmed after positive test	67800

* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.

** An estimate is also provided for dates before the forecast date, which corresponds to the output of the fitted model; this is the a priori estimate. True forecast values are those that have a Date higher than ForecastDate; which are the a posteriori estimates. Another way to distinguish between a priori and a posteriori estimates is to see if a given date has a value for both Confirmed and Estimated (a priori) or if the Confirmed value is null (a posteriori).

Active cases and categories

Another dataset available is data_categories.csv, which has the following columns:

Name	Description	Example
Date*	ISO 8601 date (YYYY-MM-DD) of the datapoint	2020-03-27
Key	`CountryCode` if country-level data, otherwise `${CountryCode}_${RegionCode}`	US_CA
NewCases	Number of reported new cases from previous day	186
NewDeaths	Number of reported new deaths from previous day	0
NewMild**	Number of estimated new mild cases from previous day	148
NewSevere**	Number of estimated new severe cases from previous day	27
NewCritical**	Number of estimated new critical cases from previous day	9
CurrentlyMild**	Number of estimated mild active cases at this date	819
CurrentlySevere**	Number of estimated severe active cases at this date	190
CurrentlyCritical**	Number of estimated critical active cases at this date	66

* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.

** See the category estimation notebook for an more thorough explanation of what each category represents and how the estimation is done.

Notes about the data

For countries where both country-level and region-level data is available, the entry which has a null value for the RegionCode and RegionName columns indicates country-level aggregation. Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.

FR: Region-level confirmed cases for France only include positive results of tests being sent to a subset of all laboratories, therefore the sum of all confirmed cases across regions is significantly lower than the country totals.

PT: Regions reported by Portugal are broken down at the NUTS-2 level, not the usual ISO 3166-2 code reported by most other countries.

Backwards compatibility

Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:

Contribute

The data from this repository has become increasingly reliant on Wikipedia sources. If you spot an error in the data, or there's a country you would like to include, the best way to contribute to this project is by helping maintain the data on the relevant Wikipedia article. Not only can that data be parsed automatically by this project, but it will also help inform millions of others that receive their information from Wikipedia. See the section below for a direct link to what Wikipedia data is being parsed by this project.

Sources of data

All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health.

Data	Source
Metadata	Wikipedia
Weather	NOAA
Mobility data	https://github.com/pastelsky/covid-19-mobility-tracker
Government response data	Oxford COVID-19 government response tracker
Country-level data	Daily reports from the ECDC portal
Argentina	Wikipedia
Australia	https://covid-19-au.github.io
Bolivia	Wikipedia
Brazil	https://github.com/elhenrico/covid19-Brazil-timeseries
Canada	Department of Health Canada
Chile	Wikipedia
China	DXY COVID-19 dataset
Colombia	Colombia's Ministry of Health
France	https://github.com/cedricguadalupe/FRANCE-COVID-19
Germany	https://github.com/jgehrcke/covid-19-germany-gae
India	Wikipedia
Indonesia	https://catchmeup.id/covid-19
Italy	Italy's Department of Civil Protection
Japan	https://github.com/swsoyee/2019-ncov-japan
Malaysia	Wikipedia
Mexico	https://github.com/carranco-sga/Mexico-COVID-19
Norway	COVID19 EU Data
Pakistan	Wikipedia
Peru	Wikipedia
Poland	COVID19 EU Data
Portugal	https://github.com/dssg-pt/covid19pt-data
Russia	Wikipedia
South Korea	Wikipedia
Spain	Datadista COVID-19 dataset
Sweden	COVID19 EU Data
Switzerland	OpenZH data
United Kingdom	https://github.com/tomwhite/covid-19-uk-data
USA	COVID Tracking Project

The data is automatically scraped and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like region-level coordinates.

Before updating the outputs, data is spot-checked using various data sources including data from local authorities like Italy's ministry of health and the reports from WHO.

Why another dataset?

This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset has intermittently experienced maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.

Update the data

To update the contents of the output folder, first install the dependencies:

# Install Ghostscript
apt-get install -y ghostscript
# Install Python dependencies
pip install -r requirements.txt

Then run the following scripts to update all datasets:

sh input/update_data.sh

chenguanyu96/data

Open COVID-19 Dataset

Explore the data

Use the data

Google Colab

R

Python

jQuery

Powershell

Understand the data

Data

Metadata

Minimal

Weather

Mobility

Response

Forecasting

Active cases and categories

Notes about the data

Backwards compatibility

Contribute

Sources of data

Why another dataset?

Update the data