/covid19-india-data

Publicly available structured COVID-19 data from India, generated automatically from the health bulletins published by states

Primary LanguagePythonMIT LicenseMIT

Covid-19 India Data 🇮🇳

License Website Database Slack

Availability of COVID-19 data is crucial for researchers and policy makers to understand the progression of the pandemic and react to it in real time. Here is recent plea from researchers in India for they urgent access to COVID data collected by government agencies. Individual states and cities in India provide detailed information in their daily media bulletins about the current situation of COVID-19 in their respective locations. However, such data (usually in the form of PDF documents) is not readily accessible in structured form.

While there are fantastic crowd-sourced efforts underway to curate such data, manual approaches cannot scale to the volume of the data produced over the long term. Unfortunately, although this project originally began anticipating this outcome, this eventuality has already come to pass.

Project Goals In this project, we use AI-assisted document and image extraction techniques to automate the extraction of such data in structured (SQL) form from the state-level daily health bulletins; and aim to make this data readily (and freely) available for further research and analysis. The target is to automate the data extraction and curation for each Indian state, so that once the extraction process of each state is complete, we can be on "autopilot" for that state, requiring little to none continued manual curation (other than to respond to changes in schema).

How to Contribute

The following are a few ways to get going. In general, you can pick up any unassigned issue, or issues tagged with help wanted, from the issue board.

✊ Own a State

priority

This is the biggest way you can contribute in the beginning stages of the project. "Owning a state" involves:

  1. Write the data extraction code for the bulletins of the state. This repository provides the starting code and helper packages to make this as simple as possible. See here for instructions.

  2. Eventually reacting (or helping others react) to additions or changes in schema for the bulletins being put out by that state. The schemas have remained quite stable all this while but this issue may show up in a few states as the pandemic evolves.

For the project to succeed, this is the most crucial part. Once the data extraction code for a state is done, the logging of data for that state is automatic and we can sit back and relax scale up to the rest of the country over time. We hope we can get to good coverage before support for covid19india.org ends on Oct 31. 🤞

😒 Data Cleaning

  1. Remove missing data / deal with missing for the plots.
  2. Idenitify possible outliers and errors.

🤓 Analysis

Analyze the data for insights, irregularities, etc. You can put up results of your analysis in your papers, blogs, etc. (and point to that from our landing page) or directly add it to our landing page as a new page.

Current state roster

State Link to Daily Bulletin Owner (backend) Owner (frontend) Status
Delhi (DL) Link Mayank Agarwal Tathagata Chakraborti ✅   COMPLETE (Wiki)
West Bengal (WB) Link Mayank Agarwal Tathagata Chakraborti ✅   COMPLETE (Wiki)
Telengana (TG) Link Mayank Agarwal Tathagata Chakraborti 🚧   IN PROGRESS (#4)
Tamil Nadu (TN) Link ⌛ Own it! (#5)
Karnataka (KA) Link ⌛ Own it! (#6)
Kerala (KL) Link ⌛ Own it! (#7)
Madhya Pradesh (MP) Link ⌛ Own it! (#8)
Punjab (PB) Link ⌛ Own it! (#9)
Uttarakhand (UK) Link1, Link2 ⌛ Own it! (#10)
Add new state

As you might have noticed, this is an incomplete list of Indian states. Not all states produce this form of data. ☹️   We will continue adding new sources over time. A comprehensive list of sources can be found at https://api.covid19india.org/csv/latest/sources_list.csv

Interested? Join the Community

slack