Scrapes data from the UDISE+ dashboard and generates a consolidated file for all states.
scrape.js
runs the crawler and downloads the exported XLSX files locally.convert.sh
uses the xlsx2csv script to convert to CSV.aggregate.py
consolidates the data from individual CSVs to a single CSV.
The final XLSX output is sorted and cleaned manually.
You'll need to have npm or yarn installed on your machine. And Python 3.x.
- Download this directory. Run
yarn
ornpm install
to install the node dependencies. - Download the
xlsx2csv.py
file from here and place it in the root directory. - If you're on a UNIX system, run
chmod +x convert.sh
to make the shell script executable.
- Edit the
scrape.js
file based on your requirements. You'll have to set the CSS selector path for the report that you are looking to scrape. Edit thedownloadData
function as per your need. Be sure to set sufficient timeout values. - Run
node scrape.js
. It'll download the XLSX files in a new folder called exports. - Edit the path in the
convert.sh
file as per your downloaded report. - Run
./convert.sh
. It'll create CSV files for all the XLSX files in the same folder in exports. - Edit the path and final output filename in the
aggregate.py
file. You might have to twiddle around with it more based on what the downloaded reports look like. - Run
python aggregate.py
. It'll create a consolidated CSV file combining all the reports into one.
Lastly, open the output of #6 in Excel and modify it as you like.