Citi Bike NYC Data Engineering Project 🚴‍♀️

This is an end to end data engineering project, This project uses Citi Bike's NYC bikeshare dataset, Citi Bike is the nation's largest bikeshare program, with 25,000 bikes and over 1,500 stations across Manhattan, Brooklyn, Queens, the Bronx, Jersey City, and Hoboken.

Problem Description

The purpose of this project is to make an end to end data pipeline which extracts the Citi bike data from the web https://s3.amazonaws.com/tripdata/index.html and Load this data in Google Cloud storage and Big Query , apply Kimbal Dimensional Modeling(Facts and Dimensions tables) to the data using dbt and build a Looker dashboard to vizualize:

Daily riders activity
Trips per start stations
Average trip duration
Bike type distribution and Membership status distribution
Rides per month and year

Technology Stack

The following technologies are used to build this project

Terraform - as Infrastructure-as-Code (IaC) tool
Google Cloud Storage (GCS) - as Data Lake
Prefect - for orchestration
dbt - for transformation and data modeling
Google BigQuery - for Data Warehouse
Google Looker studio - for visualizations

Data Pipeline Architecture

Data Dictionary

Column	Description
r_id	Unique surrogate built using ride_id and started_at
ride_id	Unique ID Assigned to Each Citibike Trip
rideable_type	Type of bikes user can take out (Docked, Classic, Electric)
started_at	Start of Trip Date and Time
ended_at	End of Trip Date and Time
start_station_name	Name of start station
start_station_id	Unique Identification Number of Station the Trip Started at
end_station_name	Name Assigned to Station the Trip Ended at
end_station_id	Unique Identification Number of Station the Trip Ended at
start_lat	Latitude of the Start Station
start_lng	Longitude of the Start Station
end_lat	Latitude of the End Station
end_lng	Longitude of the End Station
member_casual	Field with Two Values Indicating Whether the Rider has a Citibike Membership or Paid with Credit Card(Casual)

Dashboard

Check out the interactive dashboard here

Reproduce it yourself

First fork this repo, and clone it to your local environment.

git clone https://github.com/mojafa/citi-bike.git

Setup your Google Cloud environment

Create a Google Cloud Platform project
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
Download the JSON credentials and save it, e.g. to ~/.gc/<credentials>
Install the Google Cloud SDK
Let the environment variable point to your GCP key, authenticate it and refresh the session token

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Check out this link for a video walkthrough.

Install all required dependencies into your environment

conda create -n citibike python=3.10
conda activate citibike
pip install -r requirements.txt

Setup your infrastructure

Run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name

sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip

change the variables.tf file with your corresponding variables, I would recommend to leave the name of the dataset, table and bucket as they are; otherwise you need to change them in the prefect flows and dbt.
To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands

cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"

Type 'yes' when prompted.

Setup your orchestration

If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run

python prefect/prefect_blocks.py

To execute the flow, run the following commands in two different terminals

prefect agent start -q 'default'

python prefect/web_to_gcs_to_bq.py

After running the flow web_to_gcs_to_bq.py you will find the data at Citibike_data_raw.Citibike_default_data, the flow will take around 45 mins to complete.

Data tranformation and modeling using dbt

cd into the dbt/Citibike_project directory and run the following command

dbt build --var 'is_test_run: false'

You will get 4 tables in Citibike_data_dbt data set

dim_neighbourhoods
Citibike_stations_lookup
facts_citibike
stg_citibike_data

Data Vizualization and Dashboarding

You can now query the data and connect it to looker to visualize the data, when connecting to data source use facts_citibike table to build the data source in looker, dont use partitioned table, as you wont get any data in your report.
go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.

Future work

Add tests.
Add CI/CD pipeline.
Github Actions workflow to run Terraform steps (like here).
Make incremental model, to append only the latest data from each load.
Transformations, documentation and data quality tests in dbt.
Dashboard status tiles (like here).
Include r/dataengineering subreddit posts.
More text analysis with NLP.
Use GPU with my VM instance.
Streaming process with Kafka and Spark.