Motor-Vehicles-Collision-Data-Analysis-NewYork

Identifying High-Risk Areas based on accident data.

Workflow/Architecture

alt text

Project about

It's course project at data-engineering-zoomcamp by DataTalksClub.

Problem: Identifying High-Risk Areas

For this project I've tried to build a batch pipeline to process motor vehicle collisions data in New York from (https://catalog.data.gov/,An official website of the GSA's Technology Transformation Services).The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions/accidents in NYC. Accidents can occur more frequently in certain neighborhoods or zip codes. Identifying these high-risk areas is crucial for improving safety measures, allocating resources effectively, and preventing accidents. We want to pinpoint the locations where accidents are most likely to happen.

Dataset

Motor Vehicle Collisions crash dataset website

Motor Vehicle Collisions crash dataset direct link

Technologies

  • Google cloud platform (GCP):
    • VM Instance to run project on it.
    • Cloud Storage to store processed data.
    • BigQuery as data source for dashboard.
  • Terraform to create cloud infrastructure.
  • Docker for containerization (docker-compose)
  • Python main programming language
  • Airflow to run data pipelines as DAGs.
  • Spark to pre-process raw data.
  • dbt to perform transformations.
  • Google data studio to visualize data.

Reproducing from scratch

1. To reproduce this code entirely from scratch, you will need to create a GCP account:

Set up your free GCP account! You'll get free $300 credit or 90 days of free usage. Project was build on GCP Debian VM Instance, so you can find code snippets for these particular case here.

  • Set up your very own service account
  • Create key in JSON
  • Save to your directory
  • download and install Google Cloud CLI
  • run export GOOGLE_APPLICATION_CREDENTIALS=<path/to/service/key>.json
  • run gcloud auth application-default login
  • new browser window will pop up having you authenticate the gcloud CLI. Make sure it says You are now authenticated with the gcloud CLI!

Next for GCP: Add permissions to your Service Account!

2. You'll need your IaC to build your infrastructure. In this project, Terraform is used

Download Terraform!

Initializing Terraform

  • Create a new directory with main.tf, and initialize your config file. How to Start
    • OPTIONAL Create variables.tf files to store your variables
  • terraform init
  • terraform plan
  • terraform apply

3. Set up Docker, Dockerfile, and docker-compose to run Airflow

To set up airflow, you can refer to this link

Data pipelines

The dataset data download, process and upload to cloud storage, transfer to data warehouse is done via these Airflow DAGs:

Local to GCS Dag

  • Runs once since there is a single dataset, can be changed accordingly though.
  • Downloads the dataset file in the csv format. This task runs by a bash script, which downloads the data.
  • Next the data is pre-processed using pyspark(changing column names, data types, etc) and saves it locally in the form of parquet file.
  • This file is then uploaded to project Cloud Storage(Data Lake).
  • Last task triggers the gcs_to_bq_dag so that it runs right after the data has been loaded to project Cloud Storage.

GCS to BQ Dag

  • The dag transfers the data in parquet files in the project Cloud Storage to the project BigQuery dataset made earlier using terraform.
  • Followed by creation of a partitioned and clustered table at project BigQuery dataset.
  • Lastly local clean up is done to erase the data from the local system.

4. Run the DAGs

  • run the local_to_gcs_dag first and wait for it to complete.
  • The last task in the local_to_gcs_dag will trigger the gcs_to_bq_dag and it will run shortly.

5. dbt cloud setup and environment

For setting up the dbt cloud environment you can refer to here.

Some screenshots from dbt cloud setup for the project-

Production Environment -

alt text

Scheduled pipeline to fetch fresh data from sources everyday -

alt text

Continuation Integration Pipeline -

alt text

dbt models lineage Graph

alt text

6. Create your dashboard

  • Go to Google Data Studio
  • Click Create > Data Source
  • Select BigQuery > Your Project ID > Dataset > Table
  • Click on Connect on the top-right and your data should now be imported to use for your dashboard!

Dashboard

Simple dashboard at Google Data studio with few graphs.

  • Accident Heatmap by Borough.
  • Time-of-Day​ Accident Distribution.​
  • Contributing Factors Pie Chart​.
  • Top 10 Riskiest Neighborhoods.

Below is a screenshot of my dashboard.

alt text

Thanks to the DataTalksClub for the oppurtunity! If you have any questions, please feel free to open a PR or send me an email.