COVID-19 India JSON-to-CSV Data Extractor
The Datasource
COVID 19 India maintains code and structured-data for COVID-19 statistics, pulled from multiple sources across India with regular updates. This data is made available by them in both JSON and CSV formats through Github.
A comprehensive data-set among these, is the data-all.json
(https://api.covid19india.org/v4/data-all.json) which contains district-wise and state-wise statistics for all districts and states in India, from 1st confirmed case (30th January 2020) till present day. The JSON scheme has date [YYYY-MM-DD]
as the key, with stats in the v4
(https://api.covid19india.org/documentation/v4_data.html) format as values.
The covid-stats.py script
The file covid-stats.py
is a Python3 script usable as a command-line tool. This script can extract specific content of interest from data-all.json
and save it into a CSV file. This CSV file could then be imported into Excel, Google-Sheets etc for further analysis.
The script provides the ability to:
- Extract content either across India, or for a single state in India
- Extract only selected metrics from the
data-all.json
file, and optionally extract within a date-range - Extract the names of all metrics (flattened json-keys) such that a mapping could be created.
- The mapping allows you to then handpick metrics of interest, reorder them and supply custom column names for the CSV.
- The mapping can be stored on disk as a JSON.
- Use a mapping file as a guide/configuration to extract metrics of interest and store it into a CSV.
Prerequisites & Theory
NOTE: This section is verbose to help serve those who aren't familiar with Python 3, CLI or JSON. Advanced users may jump to sections of relevance directly
This script requires Python 3 installed on a system. You'll also require the script covid-stats.py and the data-all.json files downloaded into a common folder on your system.
If you are on Windows, you'll need to run python3 covid-stats.py
from the Windows Command Prompt in the folder where the above files are placed.
If you are on *nix (Linux, MacOS), you'll need to run ./covid-stats.py
from a Terminal. Remember to set execute permissions: chmod +x ./covid-stats.py
Instructions henceforth will use the *nix format (./covid-stats.py
). Windows users; when you see this, use python3 covid-stats.py
.
Viewing CLI switches
You can use the -h
switch to see help, which lists all the switches.
> ./covid-stats.py -h
usage: covid-stats.py [-h] [-i INPUT_FILE] [-u URL] [-n] [-o OUTPUT]
[-x] -s STATE [-c] [-m MAPPING] [--na NA]
[--start START] [--end END]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
Input json to fetch data from
-u URL, --url URL URL from which to source data
-n, --net Fetch data from a URL
-o OUTPUT, --output OUTPUT
Set output file for CSV (optional) or mapping
JSON (mandatory).
-x, --overwrite Overwrite output file if it exists.
-s STATE, --state STATE
Two character Indian state code for which stats
must be processed (Ex: KA, HP etc). Use 'IN'
for all-India.
-c, --columns Print only column names as a dict
-m MAPPING, --mapping MAPPING
Path to key-mapping file
--na NA Replacement string for 'no-value'
--start START Include data on or after this date (YYYY-MM-DD)
--end END Exclude data beyond this date (YYYY-MM-DD)
Types of operations
At any given time, you can get the script to do one of the two operations:
- Generate a "mapping" file (a JSON file) comprising of all flattened keys for different metrics available in
data-all.json
filtered based on the supplied conditions. You can modify the mapping file, to further suit your convenience. - Generate a CSV file with data extracted from the
data-all.json
file, against filter conditions provided both on command line, and optionally from the "mapping" file.
Sections below talk about each step in detail
Mapping metrics of interest
The data-all.json
has its data in a hierarchical structure. The target format is a CSV (x*y grid without hierarchy)
If the hierarchy had to be flattened such that each metric was represented as a independent column in a CSV, this would result in over 6700+ metrics (states-in-India * metrics-per-state, states-in-India->Districts-in-each-state->metrics-per-district etc). A likely scenario is that a subset of metrics are of interest, and the script could extract only those metrics which are required for analysis.
The script achieves this by generating a "mapping" file for either a state in India, or for the whole of India. A "mapping" file is a JSON file with flattened keys dumped as keys: values
in a new JSON. A flattened key is a path leading to a value from the hierarchy in data-all.json
, represented as a hyphen -
separated string in the "mapping" JSON.
Example:
A nested JSON:
1. "KA" is key at depth-1. It's value is another JSON object with two keys "Bengaluru" and "Total".
2. "KA"->"Bengaluru" is a key at depth-2. It's value is yet another JSON object, with one key "Confirmed"
3. "KA"->"Bengaluru"->"Confirmed" is a key at depth-3, with a value 1000
4. "KA"->"Total" is a key at depth-2, with a value 1000
{
"KA": {
"Bengaluru": {
"Confirmed": 1000
}
"Total": 1000
}
}
A flattened, representation would have:
{
"KA-Bengaluru-Confirmed": 1000,
"KA-Total": 1000
}
^^^ The flattened JSON has all keys at depth-1, with no hierarchy
What is the purpose of mapping
Mapping allows keys of interest to be filtered, ordered and renamed, such that data is extracted only for metrics of interest. When the Mapping is used as reference to extract and write data to into the CSV, it is done based on the order in which it is found in the mapping, using the values as column names.
Example: A mapping generated by the script would look like this:
{
"KA-Bengaluru-Confirmed": "KA-Bengaluru-Confirmed",
"KA-Total": "KA-Total"
}
The keys and values are the same in the generated mapping. However, you are welcome to modify it.
The key on the left is used to identify the metric from the data-all.json
. So, this key must be left as is.
The value on the right is the name you wish to provide to the column in the CSV. So, this could be modified. You may also re-order the keys. The order in the mapping JSON (top-to-bottom) determines the order in the CSV (left-to-right)
Example:
{
"KA-Total": "KA-TOTAL"
"KA-Bengaluru-Confirmed": "KA-BLR-CNF",
}
The mapping above will result in extraction of data KA-Total
and KA-Bengaluru-confirmed
from the data-all.json
. However, in the CSV, they'll be written with KA-TOTAL
as the first column, and KA-BLR-CNF
as the 2nd column (to its right).
Usage
How do the switches work?
The -i
switch is used to supply the path to the data-all.json
on your disk. By default, the script tries look for a data-all.json
in your current working directly. If you have one, then you need not supply this switch. The recommended way to use this tool is to download the data-all.json
ahead of time, place it alongside the script, and use the script locally.
An advanced option is to use
-n
to point the script to download the latest copy ofdata-all.json
from the default COVID-19-India's URL. You can additionally supply a-u
option to override the default URL (likely unnecessary). NOTE: This requires therequests
library to be installed (pip3 install requests
)
The -o
switch is used to define the output file. The output is used for writing both the CSV file as well as the mapping file. By default, the script avoids accidental rewrites, and quits if the output file already exists on disk. Use -x
to allow overwriting.
The -c
switch gets the script to dump a mapping file. When this is not supplied, it is assumed that you wish to write a CSV.
The -s
switch indicates region. At this point, this is restricted to the two character codes of states in India (KA
, KL
, TN
, etc.) or IN
for the whole of India.
The --start
and --end
switches will restrict processing to dates which fall in between the start and end dates (both inclusive).
The --na
switch allows defining a custom "empty" value, when an actual value is not found in the data-all.json
. By default this is "0" as the metrics are mostly numbers.
Usage Examples
NOTE: In both examples -i data-all.json
is optional if you have data-all.json
stored locally in your current folder
1. Generate a mapping for the state of Karnataka
./covid-stats.py -i data-all.json -s KA -c -o KA-BLR.json
Note the -c
being passed here to indicate dumping the mapping columns. The this command fetchs all keys related to Karnataka (KA) and dumps them in flattened form into a mapping file KA-BLR.json
Once this is done, you can manually edit the KA-BLR.json
file, remove entries that you don't require, rename the values to your desired column names, or change the order of the entries. Save this file after your modifications.
A modified KA-BLR.json is available for reference.
2. Use a mapping file to extract data into a CSV
./covid-stats.py -i data-all.json -s KA -m KA-BLR.json --start 2020-09-01 --end 2020-09-10 -o KA-DATA.csv
Note the lack of -c
in the above command, which indicates that a CSV is being written and not a mapping file.
Also, note the -m KA-BLR.json
which tells the script to use the mapping file KA-BLR.json
which was edited in the previous step. Only the metrics of interest from KA-BLR.json
are fetched. Additionally, the data is fetched only for dates 1st September 2020 to 10th September 2020 (both dates inclusive), and written to KA-DATA.csv
The contents of KA-DATA.csv
will be as follows:
"DATE","BLR-CNF","BLR-DEC","BLR-OTH","BLR-REC","BLR-TST","KA-CNF","KA-DEC","KA-OTH","KA-REC","KA-TST"
"2020-09-01","132092","2005","1","91180","892219","351481","5837","19","254626","2979477"
"2020-09-02","135512","2037","1","93563","919200","361341","5950","19","260913","3052794"
"2020-09-03","138701","2066","1","96194","947518","370206","6054","19","268035","3123918"
"2020-09-04","141664","2091","1","97926","978563","379486","6170","19","274196","3197110"
"2020-09-05","144757","2125","1","101152","1009349","389232","6298","19","283298","3273871"
"2020-09-06","147581","2163","1","105692","1038077","398551","6393","19","292873","3348255"
"2020-09-07","150523","2211","1","108642","1060813","404324","6534","19","300770","3393676"
"2020-09-08","153625","2266","1","110972","1060813","412190","6680","19","308573","3461119"
"2020-09-09","157044","2307","1","112536","1112090","421730","6808","19","315433","3531441"
"2020-09-10","160205","2340","1","114208","1137027","430947","6937","19","322454","3586150"
Changelog
0.0.2
- Added version tracking in the code
- Switches '-o/--output' with '-x' now apply to output of both CSV as well as mapping JSON depending on context.
- In the mapping JSON, the custom column names used to be empty strings to allow custom names to be filled in. Now they are the same as the original column names, so, you can rename only those that you need.
- The script prints on console about whether it wrote a CSV or a Mapping JSON.
0.0.1
Initial release of the script