This repository contains datasets of daily time-series data related to COVID-19, including state/province data for over 30 countries.
A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer: | If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset: |
You can also check out the great work of @quixote79, a MapBox-powered interactive map site: | Experience clean, clear graphs with smooth animations thanks to the work of @jmullo: |
If you are using this data, feel free to open an issue and let us know so we can give you a call-out here.
The data is available as CSV and JSON files, which are published in Github
Pages so they can be served directly to Javascript applications without the
need of a proxy to set the correct headers for CORS and content type.
data.csv
has a version with all historical data, and another version with
only the latest daily data. All other datasets only have either historical or
the latest data. The datasets available from this project are:
You should use the files linked above instead of anything in the output
subfolder via the Raw Github server, since the files under the output
subfolder are subject to change in incompatible ways with no prior notice.
You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.
You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/data.
If you prefer R, then this is all you need to do to load the historical data:
data <- read.csv("https://open-covid-19.github.io/data/data.csv")
In Python, you need to have the package pandas
installed to get
started:
import pandas
data = pandas.read_csv("https://open-covid-19.github.io/data/data.csv")
Loading the JSON file using jQuery can be done directly from the output folder,
this code snippet loads all historical data into the data
variable:
$.getJSON("https://open-covid-19.github.io/data/data.json", data => { ... }
You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest data for Australia:
Invoke-WebRequest 'https://open-covid-19.github.io/data/data_latest.csv' | ConvertFrom-Csv | `
where Key -eq 'AU' | select Date,CountryName,Confirmed,Deaths
Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment. The columns of data.csv are:
Name | Description | Example |
---|---|---|
Date* | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-21 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
CN_HB |
CountryCode | ISO 3166-1 code of the country | CN |
CountryName | American English name of the country, subject to change | China |
RegionCode | (Optional) ISO 3166-2 or NUTS 2 code of the region | HB |
RegionName | (Optional) American English name of the region, subject to change | Hubei |
Confirmed** | Total number of cases confirmed after positive test | 67800 |
Deaths** | Total number of deaths from a positive COVID-19 case | 3139 |
Latitude | Floating point representing the geographic coordinate | 30.9756 |
Longitude | Floating point representing the geographic coordinate | 112.2707 |
Population | Total count of humans living in the region | 58500000 |
* Date used is reporting date, which generally lags a day from the actual date and is subject to timezone adjustments. Whenever possible, dates consistent with the ECDC daily reports are used.
** Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported. For example, US states where deaths are not being reported have null values.
The CountryName
and RegionName
values are subject to change. You may use
them for labels in your application, but you should not assume that they will
remain the same in future updates. Instead, use CountryCode
and RegionCode
to perform joins with other data sources or for filtering within your
application.
Non-temporal data related to countries and regions. The columns of metadata.csv are:
Name | Description | Example |
---|---|---|
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
CountryCode | ISO 3166-1 code of the country | CN |
CountryName | American English name of the country, subject to change | China |
RegionCode | (Optional) ISO 3166-2 or NUTS 2 code of the region | HB |
RegionName | (Optional) American English name of the region, subject to change | Hubei |
Latitude | Floating point representing the geographic coordinate | 30.9756 |
Longitude | Floating point representing the geographic coordinate | 112.2707 |
Population | Total count of humans living in the region | 58500000 |
There is a data_minimal.csv with a subset of the columns from data.csv but otherwise identical information:
Name | Description | Example |
---|---|---|
Date* | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
Confirmed** | Total number of cases confirmed after positive test | 6447 |
Deaths** | Total number of deaths from a positive COVID-19 case | 133 |
* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.
** Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported. For example, US states where deaths are not being reported have null values.
Daily weather information from nearest station reported by NOAA. The columns of weather.csv are:
Name | Description | Example |
---|---|---|
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_MI |
Date | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
Station | Identifier for the weather station | USC00206080 |
Distance | [kilometers] Distance between the location coordinates and the weather station | 28.693 |
MinimumTemperature* | [celsius] Recorded hourly minimum temperature | 1.7 |
MaximumTemperature* | [celsius] Recorded hourly maximum temperature | 19.4 |
Rainfall* | [millimeters] Rainfall during the entire day | 51.0 |
Snowfall* | [millimeters] Snowfall during the entire day | 0.0 |
* Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.
Google's Mobility Reports are presented in CSV form as mobility.csv with the following columns:
Name | Description | Example |
---|---|---|
Date | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-25 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
TransitStations | Percentage change in visits to transit station locations | -15 |
RetailAndRecreation | Percentage change in visits to retail and recreation locations | -15 |
GroceryAndPharmacy | Percentage change in visits to grocery and pharmacy locations | -15 |
Parks | Percentage change in visits to park locations | -15 |
Residential | Percentage change in visits to residential locations | -15 |
Workplaces | Percentage change in visits to workplace locations | -15 |
Summary of a government's response, including a stringency index, collected from University of Oxford:
Name | Description | Example |
---|---|---|
Date | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-25 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
SchoolClosing | [0-3] Schools are closed | 2 |
WorkplaceClosing | [0-3] Workplaces are closed | 2 |
CancelPublicEvents | [0-3] Public events have been cancelled | 2 |
PublicTransportClosing | [0-3] Public transport is not operational | 0 |
PublicInformationCampaigns | [0-2] Government has launched public information campaigns | 1 |
RestrictionsOnInternalMovement | [0-3] Travel within country is restricted | 1 |
InternationalTravelControls | [0-3] International travel is restricted | 3 |
FiscalMeasures | [USD] Value of fiscal stimuli, including spending or tax cuts | 20449287023 |
MonetaryMeasures | [%] Value of interest rate | -0.75 |
EmergencyInvestmentInHealthCare | [USD] Emergency funding allocated to healthcare | 500000 |
InvestmentInVaccines | [USD] Emergency funding allocated to vaccine research | 100000 |
TestingFramework | [0-3] Country-wide COVID-19 testing policy | 1 |
ContactTracing | [0-2] Country-wide contact tracing policy | 1 |
StringencyIndex | [0-100] Overall stringency index | 71.43 |
For more information about each field and how the overall stringency index is computed, see the Oxford COVID-19 government response tracker.
Note: Keys which correspond to a region-level datapoint always have the same value as the country-level datapoint, since the tracked government measures are at the country level.
There is also a short-term forecast dataset available in the output folder as data_forecast.csv, which has the following columns:
Name | Description | Example |
---|---|---|
ForecastDate | ISO 8601 date (YYYY-MM-DD) of last known datapoint | 2020-03-21 |
Date* | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-25 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
Estimated** | Total number of cases estimated from forecasting model | 66804.567 |
Confirmed | Total number of cases confirmed after positive test | 67800 |
* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.
** An estimate is also provided for dates before the forecast date, which corresponds to the output of the fitted model; this is the a priori estimate. True forecast values are those that have a Date higher than ForecastDate; which are the a posteriori estimates. Another way to distinguish between a priori and a posteriori estimates is to see if a given date has a value for both Confirmed and Estimated (a priori) or if the Confirmed value is null (a posteriori).
Another dataset available is data_categories.csv, which has the following columns:
Name | Description | Example |
---|---|---|
Date* | ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-27 |
Key | CountryCode if country-level data, otherwise ${CountryCode}_${RegionCode} |
US_CA |
NewCases | Number of reported new cases from previous day | 186 |
NewDeaths | Number of reported new deaths from previous day | 0 |
NewMild** | Number of estimated new mild cases from previous day | 148 |
NewSevere** | Number of estimated new severe cases from previous day | 27 |
NewCritical** | Number of estimated new critical cases from previous day | 9 |
CurrentlyMild** | Number of estimated mild active cases at this date | 819 |
CurrentlySevere** | Number of estimated severe active cases at this date | 190 |
CurrentlyCritical** | Number of estimated critical active cases at this date | 66 |
* Date used is adjusted reporting date. ECDC reporting date generally lags a day from the actual date. Time zone is used to adjust the date such that it matches local reports.
** See the category estimation notebook for an more thorough explanation of what each category represents and how the estimation is done.
For countries where both country-level and region-level data is available, the
entry which has a null value for the RegionCode
and RegionName
columns
indicates country-level aggregation. Please note that, sometimes, the
country-level data and the region-level data come from different sources so
adding up all region-level values may not equal exactly to the reported
country-level value. See the data loading tutorial for more information.
FR: Region-level confirmed cases for France only include positive results of tests being sent to a subset of all laboratories, therefore the sum of all confirmed cases across regions is significantly lower than the country totals.
PT: Regions reported by Portugal are broken down at the NUTS-2 level, not the usual ISO 3166-2 code reported by most other countries.
Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:
The data from this repository has become increasingly reliant on Wikipedia sources. If you spot an error in the data, or there's a country you would like to include, the best way to contribute to this project is by helping maintain the data on the relevant Wikipedia article. Not only can that data be parsed automatically by this project, but it will also help inform millions of others that receive their information from Wikipedia. See the section below for a direct link to what Wikipedia data is being parsed by this project.
All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health.
The data is automatically scraped and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like region-level coordinates.
Before updating the outputs, data is spot-checked using various data sources including data from local authorities like Italy's ministry of health and the reports from WHO.
This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset has intermittently experienced maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.
To update the contents of the output folder, first install the dependencies:
# Install Ghostscript
apt-get install -y ghostscript
# Install Python dependencies
pip install -r requirements.txt
Then run the following scripts to update all datasets:
sh input/update_data.sh