COVID-19 Open-Data

This repository contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

0: Country
1: Province, state, or local equivalent
2: Municipality, county, or local equivalent
3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

Table	Keys¹	Content	URL	Source²
Main	`[key][date]`	Flat table with records from (almost) all other tables joined by `date` and/or `key`; see below for more details	main.csv	All tables below
Index	`[key]`	Various names and codes, useful for joining with other datasets	index.csv, index.json	Wikidata, DataCommons, Eurostat
Demographics	`[key]`	Various (current³) population statistics	demographics.csv, demographics.json	Wikidata, DataCommons, WorldBank, WorldPop, Eurostat
Economy	`[key]`	Various (current³) economic indicators	economy.csv, economy.json	Wikidata, DataCommons, Eurostat
Epidemiology	`[key][date]`	COVID-19 cases, deaths, recoveries and tests	epidemiology.csv, epidemiology.json	Various²
Emergency Declarations	`[key][date]`	Government emergency declarations and mitigation policies	lawatlas-emergency-declarations.csv	LawAtlas Project
Geography	`[key]`	Geographical information about the region	geography.csv, geography.json	Wikidata
Health	`[key]`	Health indicators for the region	health.csv, health.json	Wikidata, WorldBank, Eurostat
Hospitalizations	`[key][date]`	Information related to patients of COVID-19 and hospitals	hospitalizations.csv, hospitalizations.json	Various²
Mobility	`[key][date]`	Various metrics related to the movement of people. To download or use the data, you must agree to the Google Terms of Service.	mobility.csv, mobility.json	Google
Search Trends	`[key][date]`	Trends in symptom search volumes due to COVID-19. To download or use the data, you must agree to the Google Terms of Service.	google-search-trends.csv	Google
Government Response	`[key][date]`	Government interventions and their relative stringency	oxford-government-response.csv, oxford-government-response.json	University of Oxford
Weather	`[key][date]`	Dated meteorological information for each region	weather.csv	NOAA
WorldBank	`[key]`	Latest record for each indicator from WorldBank for all reporting countries	worldbank.csv, worldbank.json	WorldBank
By Age	`[key][date]`	Epidemiology and hospitalizations data stratified by age	by-age.csv, by-age.json	Various²
By Sex	`[key][date]`	Epidemiology and hospitalizations data stratified by sex	by-sex.csv, by-sex.json	Various²

¹ key is a unique string for the specific geographical region built from a combination of codes such as ISO 3166, NUTS, FIPS and other local equivalents.
² Refer to the data sources for specifics about each data source and the associated terms of use.
³ Datasets without a date column contain the most recently reported information for each datapoint to date.

For more information about how to use these files see the section about using the data, and for more details about each dataset see the section about understanding the data.

Why another dataset?

There are many other public COVID-19 datasets. However, we believe this dataset is unique in the way that it merges multiple global sources, at a fine spatial resolution, using a consistent set of region keys. We hope this will make it easier for researchers to use. We are also very transparent about the data sources, and the code for ingesting and merging the data is easy to understand and modify.

Explore the data


A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer:	If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset:	You can also check out the great work of @quixote79, a MapBox-powered interactive map site:
Experience clean, clear graphs with smooth animations thanks to the work of @jmullo:	Become an armchair epidemiologist with the COVID-19 timeline simulation tool built by @LeviticusMB:	Whether you want an interactive map, compare stats or look at charts, @saadmas has you covered with a COVID-19 Daily Tracking site:
Compare per-million data at Omnimodel thanks to @OmarJay1:	Look at responsive, comprehensive charts thanks to the work of @davidjohnstone:	Reproduction Live lets you track COVID-19 outbreaks in your region and visualise the spread of the virus over time:

Use the data

The data is available as CSV and JSON files, which are published in Google Cloud Storage so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type.

For the purpose of making the data as easy to use as possible, there is a main table which contains the columns of all other tables joined by key and date. However, performance-wise, it may be better to download the data separately and join the tables locally.

Each region has its own version of the main table, so you can pull all the data for a specific region using a single endpoint, the URL for each region is:

Data for key in CSV format: https://storage.googleapis.com/covid19-open-data/v2/${key}/main.csv
Data for key in JSON format: https://storage.googleapis.com/covid19-open-data/v2/${key}/main.json

Each table has a full version as well as subsets with only the last day of data. The full version is accessible at the URL described in the table above. The subsets can be found by appending latest to the path. For example, the subsets of the main table are available at the following locations:

Latest: https://storage.googleapis.com/covid19-open-data/v2/latest/main.csv
Time series: https://storage.googleapis.com/covid19-open-data/v2/main.csv

Note that the latest version contains the last non-null record for each key. All of the above listed tables have a corresponding JSON version; simply replace csv with json in the link.

If you are trying to use this data alongside your own datasets, then you can use the Index table to get access to the ISO 3166 / NUTS / FIPS code, although administrative subdivisions are not consistent among all reporting regions. For example, for the intra-country reporting, some EU countries use NUTS2, others NUTS3 and many ISO 3166-2 codes.

You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.

BigQuery

This dataset is part of the BigQuery Public Datasets Program, so you may use BigQuery to run SQL queries directly from the online query editor free of charge.

Google Colab

You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/GoogleCloudPlatform/covid-19-open-data.

Google Sheets

You can import the data directly into Google Sheets, as long as you stay within the size limits. For instance, the following formula loads the latest epidemiology data into the current sheet:

=IMPORTDATA("https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv")

Note that Google Sheets has a size limitation, so only data from the latest subfolder can be imported automatically. To work around that, simply download the file and import it via the File menu.

R

If you prefer R, then this is all you need to do to load the epidemiology data:

data <- read.csv("https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv")

Python

In Python, you need to have the package pandas installed to get started:

import pandas
data = pandas.read_csv("https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv")

jQuery

Loading the JSON file using jQuery can be done directly from the output folder, this code snippet loads the epidemiology table into the data variable:

$.getJSON("https://storage.googleapis.com/covid19-open-data/v2/epidemiology.json", data => { ... }

Powershell

You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest epidemiology data for Australia:

Invoke-WebRequest 'https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv' | ConvertFrom-Csv | `
    where key -eq 'AU' | select date,total_confirmed,total_deceased,total_recovered

Understand the data

Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment in non-compatible ways, and due to the configuration of GitHub's raw file server you may run into potential caching issues.

Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.

For information about each table, see the corresponding documentation linked above.

Main table

Flat table with records from all other tables joined by key and date. See above for links to the documentation for each individual table. Due to technical limitations, not all tables can be included as part of this main table.

Notes about the data

For countries where both country-level and subregion-level data is available, the entry which has a null value for the subregion level columns in the index table indicates upper-level aggregation. For example, if a data point has values {country_code: US, subregion1_code: CA, subregion2_code: null, ...} then that record will have data aggregated at the subregion1 (i.e. state/province) level. If subregion1_codewere null, then it would be data aggregated at the country level.

Another way to tell the level of aggregation is the aggregation_level of the index table, see the schema documentation for more details about how to interpret it.

Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.

Data updates

The data for each table is updated at least daily. Individual tables, for example Epidemiology, have fresher data than the main table and are updated multiple times a day. Each individual data source has its own update schedule and some are not updated in a regular interval; the data tables hosted here only reflect the latest data published by the sources.

Contribute

Technical contributions to the data extraction pipeline are welcomed, take a look at the source directory for more information.

If you spot an error in the data, feel free to open an issue on this repository and we will review it.

If you do something with this data, for example a research paper or work related to visualization or analysis, please let us know!

For Data Owners

We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.

If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please open an issue on this repository and we will happily consider your request.

Licensing

The output data files are published under the CC BY license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.

Sources of data

All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health.

Show data sources

Data	Source	License and Terms of Use
Metadata	Wikipedia	Terms of Use
Metadata	Eurostat	CC BY
Demographics	Wikidata	CC0
Demographics	DataCommons	Attribution required
Demographics	WorldBank	CC BY
Demographics	WorldPop	CC BY
Economy	Wikidata	CC0
Economy	DataCommons	Attribution required
Economy	WorldBank	CC BY
Geography	Wikidata	CC0
Geography	WorldBank	CC BY
Health	Wikidata	CC0
Health	WorldBank	CC BY
Weather	NOAA	Attribution required, non-commercial use
Google Mobility data	https://www.google.com/covid19/mobility/	Google Terms of Service
Google Search Trends	https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-search-trends	Google Terms of Service
Emergency declarations and mitigation policies	LawAtlas	CC BY
Government response data	Oxford COVID-19 government response tracker	CC BY
Country-level data	ECDC	Attribution required
Country-level data	Our World in Data	CC BY
Country-level data	WHO	Attribution required
Afghanistan	HDX	CC BY
Argentina (2010 Census)	Instituto Nacional de Estadística y Censos	Public domain
Argentina	Datos Argentina	Public domain
Australia	https://covid-19-au.com/	Attribution required
Austria	COVID19 EU Data	MIT
Bangladesh	http://covid19tracker.gov.bd	Public Domain
Belgium	Belgian institute for health	Attribution required
Brazil	Brazil Ministério da Saúde	Creative Commons Atribuição
Brazil (Rio de Janeiro)	http://www.data.rio/	Dados abertos
Brazil (Ceará)	https://saude.ce.gov.br	Dados abertos
Canada	Department of Health Canada	Attribution required
Canada	COVID-19 Canada Open Data Working Group	CC BY
Chile	Ministerio de Ciencia de Chile	Terms of use
Chile (2017 Census)	Instituto Nacional de Estadística	CC BY
China	DXY COVID-19 dataset	MIT
Colombia	Datos Abiertos Colombia	Attribution required
Czech Republic	Ministry of Health of the Czech Republic	Open Data
Democratic Republic of Congo	HDX	CC BY
Estonia	Health Board of Estonia	Open Data
Finland	Finnish institute for health and welfare	CC BY
France	data.gouv.fr	Open License 2.0
Germany	Robert Koch Institute	Attribution Required
Haiti	HDX	CC-BY
Hong Kong	Hong Kong Department of Health	Attribution Required
Israel	Israel Government Data Portal	Attribution Required
Israel (2019 Census)	Central Bureau of Statistics	Attribution Required
Haiti	HDX	CC BY
India	Wikipedia	Attribution Required
India	Covid 19 India Organisation	CC BY
Indonesia	https://covid19.go.id/peta-sebaran	Public Domain
Indonesia (2020 Census)	Central Bureau of Statistics	Attribution required
Italy	Italy's Department of Civil Protection	CC BY
Iraq	HDX	CC BY
Japan	https://github.com/swsoyee/2019-ncov-japan	MIT
Japan	https://github.com/kaz-ogiwara/covid19	MIT
Libya	HDX	CC BY
Luxembourg	data.public.lu	CC0
Malaysia	Wikipedia	Attribution Required
Mexico	Secretaría de Salud Mexico	Attribution Required
Mexico (2010 Census)	INEGI	Attribution Required
Netherlands	RIVM	Public Domain
New Zealand	Ministry of Health	CC-BY
Norway	COVID19 EU Data	MIT
Pakistan	Wikipedia	Attribution Required
Peru	Datos Abiertos Peru	ODC BY
Peru (2017 Census)	INEI	ODC BY
Philippines	Philippines Department of Health	Attribution required
Poland	COVID19 EU Data	MIT
Portugal	COVID-19: Portugal	MIT
Romania	https://github.com/adrianp/covid19romania	CC0
Romania	https://datelazi.ro/	Terms of Service
Russia	https://стопкоронавирус.рф	CC BY
Slovenia	https://www.gov.si	Attribution Required
South Africa	FinMango COVID-19 Data	CC BY
South Korea	Wikipedia	Attribution Required
Spain	Government Authority	Attribution required
Spain (Canary Islands)	Gobierno de Canarias	Attribution required
Spain (Catalonia)	Dades Obertes Catalunya	CC0
Spain (Madrid)	Datos Abiertos Madrid	Attribution required
Sudan	HDX	CC BY
Sweden	Public Health Agency of Sweden	Fair Use
Switzerland	OpenZH data	CC BY
Taiwan	Ministry of Health and Welfare	Attribution Required
Thailand	Ministry of Public Health	Fair Use
Ukraine	National Security and Defense Council of Ukraine	CC BY
United Kingdom	https://github.com/tomwhite/covid-19-uk-data	The Unlicense
United Kingdom	https://coronavirus.data.gov.uk/	Attribution required, Open Government Licence v3.0
USA (2019 Census)	United States Census Bureau	Public Domain
USA	NYT COVID Dataset	Attribution required, non-commercial use
USA	Imperial College of London	CC BY
USA	COVID Tracking Project	CC BY
USA (Alaska)	Alaska Department of Health and Social Services
USA (D.C.)	Government of the District of Columbia	Public Domain
USA (Delaware)	Delaware Health and Social Services	Public Domain
USA (Florida)	Florida Health	Public Domain
USA (Indiana)	Indiana Department of Health	CC BY
USA (Massachusetts)	MCAD COVID-19 Information & Resource Center	Public Domain
USA (New York)	New York City Health Department	Public Domain
USA (San Francisco)	SF Open Data	Public Domain Dedication and License
USA (Texas)	Texas Department of State Health Services	Attribution required
USA (Washington)	Washington State Department of Health	Public Domain
Venezuela	HDX	CC BY

Running the data extraction pipeline

See the source documentation for more technical details.

Acknowledgments and collaborations

This project has been done in collaboration with FinMango, which provided great insights about the impact of the pandemic on the local economies and also helped with research and manual curation of data sources for many regions including South Africa and US states.

Stratified mortality data for US states is provided by Imperial College of London. Please refer to this list of maintainers and contributors for the individual acknowledgements.

The following persons have made significant contributions to this project:

Oscar Wahltinez
Kevin Murphy
Michael Brenner
Matt Lee
Anthony Erlinger
Mayank Daswani
Pranali Yawalkar
Zack Ontiveros
Ruth Alcantara
Donny Cheung

Recommended citation

Please use the following when citing this project as a source of data:

@article{Wahltinez2020,
  author = "O. Wahltinez and others",
  year = 2020,
  title = "COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2",
  note = "Work in progress",
  url = {https://goo.gle/covid-19-open-data},
}

pazamelin/covid-19-open-data