This repository contains code analyzing the impact of COVID-19 on rideshare in Chicago. The setup, ingest, and update phases are heavily reliant on Todd W Schneider's excellent work: you can see his repo here and his corresponding blog post here.
WARNING: As of May 2021, the TNP data is ~50GB all in and the ingest process for trip data took several hours on my laptop - not for the faint of heart.
Most raw data comes from the City of Chicago Data Portal:
- TNP Data (Uber/Lyft)
- Spatial Data
Additional Census Tract-level information is pulled from the American Community Survey using Kyle Walker's handy tidycensus package.
This section will mirror most of the steps outlined by Schneider, but my own setup was a little different - I have a machine running Windows 10, the scope of this analysis includes drivers/vehicles/census data, etc. - your mileage may vary.
-
Install Git and GitHub Desktop. If necessary, add additional required utilities like Wget
I run the ETL script using Git Bash but I prefer GitHub Desktop rather than messing with the command line for everything else. Wget is also a required utility to request the TNP files from the Data Portal if you are not already running linux.
-
Install PostgreSQL and PostGIS. Add Postgres to your PATH
I recommend downloading PostgreSQL 12.4. PostGIS and other extensions are not yet available for PostgreSQL 13. The installer for PostgreSQL allowed me to download the spatial extension (PostGIS) directly, but you may have to do so separately.
You should also add Postgres to your PATH so that the Git Bash shell can run
psql
commands. The scripts as written reference the default userpostgres
. You may have to store your password using the PGPASSWORD environment variable if Git Bash keeps demanding a password. You can runpsql -U postgres -c 'create database test;'
in Git Bash to make sure everthing is working OK. -
Analysis and Census data ingest are done using R scripts and RMarkdown.
Within the shell_scripts/
subfolder, open Git Bash and run the following to grab TNP trips, drivers, and vehicles data:
./01_etl_tnp_script.sh
This process will take several hours, but if completed correctly when you open PgAdmin you should see a database called chicago_tnp_data
with populated data in the trips
, drivers
, and vehicles
tables. You can remove
the temporary CSVs created in the the data/
subfolder during the download phase to free up space by then running:
./03_delete_csvs.sh
Within the analysis/
subfolder, open rideshare_analysis.Rmd
. To connect to the data stored in Postgres, you must have a config.yml
file saved within analysis/
that defines the following variables:
default:
host: "localhost"
dbname: "chicago_tnp_data"
port: 5432
user: "postgres"
password: "YOUR PASSWORD HERE"
api_key_census: "YOUR KEY HERE"
Within the analysis/
subfolder, open and run spatial_import.R
.
The city releases new TNP data every quarter. To update the data, navigate to the shell_scripts/
subfolder and run
./02_update_tnp_data.sh
This script only grabs the latest TNP trip data, rather than rebuilding the entire database. TNP driver and vehicle data, however, are dumped and rebuilt from scratch. Compared to trips, these are relatively small (they only take a couple of minutes), but the process could be optimized in the future.