Data Engineering ZoomCamp Course Project - US Accidents

Preface

This repository contains the course project for the Data Engineering Zoomcamp (Cohort 2023) organized by the by DataTalks.Club community. The project covers main data engineering skills taught in the course:

Workflow Orchestration: Data Lake, Prefect tool, ETL with GCP & Prefect
Data Warehouse: BigQuery
Analytics engineering: dbt (data build tool)
Data Analysis: Looker Studio

US Accidents Project

Dataset

US car crash dataset (covers 49 states). Crash data is collected from February 2016 to December 2021 using various APIs that provide streaming traffic incident (or event) data. These APIs transmit traffic data captured by a variety of entities, such as US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors on road networks. There are currently around 2.8 million crash records in this dataset.

The dataset has 47 columns, but for the present project I decided to select only the relevant columns for my analysis. The following columns will be used:

#	Attribute	Description
1	ID	This is a unique identifier of the accident record.
2	Severity	Shows the severity of the accident, a number between 1 and 4. 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3	Start_Time	Shows the start time of the accident in local time zone.
4	End_Time	Shows the end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
5	Description	Shows the natural language description of the accident.
6	Street	Shows the street name in address field.
7	City	Shows the city in address field.
8	State	Shows the state in address field.
9	Country	Shows the country in address field.
10	Weather_Condition	Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
11	Sunrise_Sunset	Shows the period of day (i.e. day or night) based on sunrise/sunset.

More information about this dataset: Author blog and Kaggle

Dataset Acknowledgments

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

Architecture Diagram

Technologies Used

Google Cloud Platform (GCP):
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
Terraform: Infrastructure as code (IaC)
dbt: Data Transformation
Pandas: Data Analysis & Exploration
Prefect: Workflow Orchestration
Looker Studio: Visualize Data

DW Table Structure

#	Attribute	Description
1	accident_id	This is a unique identifier of the accident record.
2	severity_id	Shows the severity of the accident, a number between 1 and 4. 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3	start_date	Shows start date of the accident was started.
4	end_date	Shows the end date of the accident was ended.
5	start_time	Shows the start time of the accident in local time zone.
6	end_time	Shows the end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
7	description	Shows the natural language description of the accident.
8	street	Shows the street name in address field.
9	city	Shows the city in address field.
10	state	Shows the state in address field.
11	country	Shows the country in address field.
12	weather_condition	Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
13	sunrise_sunset	Shows the period of day (i.e. day or night) based on sunrise/sunset.

Partitioning and Clustering:

Partition by column start_date, more specifically by year to obtain annual granularity
Clustering by column country to group data that have the same country value

Benefits of combining clustered and partitioned tables: Combining clustered and partitioned tables

Data visualization: Dashboards

Main Questions

Which State/City/Street in US has reported most number of Accident Cases between 2016 and 2021?
How are the weather conditions in most of the accident cases in US?
Did most accidents occur at night or during the day?

US Crash Accidents by State, City and Street - Dashboard

US Crash Accidents by Severity, Weather Conditions, Day/Night and Date (Year and Month)

More detailed analysis of the results obtained: Data Analysis

How to reproduce this project?

Step 1: Clone this repo and install necessary requirements

Clone the repo into your local machine:

git clone git@github.com:tmaferreira/DataEngineeringZoomCampProject.git

Install all required dependencies into your environment

pip3 install -r requirements.txt

Step 2: Setup of GCP

Create a Google Cloud Platform (GCP) free account with your Google e-mail
Create a new GCP project with the name dezoomcamp-finalproject (Note: Save the assigned Project ID. Projects have a unique ID and for that reason another ID will be assigned)
Create a Service Account:
- Go to IAM & Admin > Service accounts > Create service account
- Provide a service account name and grant the roles: Viewer + BigQuery Admin + Storage Admin + Storage Object Admin
- Download the Service Account json file
- Download SDK for local setup
- Set environment variable to point to your downloaded GCP keys:
```
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
```
```
# Refresh token/session, and verify authentication
gcloud auth application-default login
```
Enable the following APIs:
- Identity and Access Management (IAM) API
- IAM Service Account Credentials API

Step 3: Creation of a GCP Infrastructure

Install Terraform
Copy files (main.tf and variables.tf) for the infrastructure creation (Use files created in Zoomcamp course: Terraform files)
In the file variables.tf change variable BQ_DATASET to: us_traffic_accidents_data
Execute the following commands to plan the creation of the GCP infrastructure:

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
# -var="project=<your-gcp-project-id>"

terraform plan -var="project=dezoomcamp-finalproject"

# Create new infra
# -var="project=<your-gcp-project-id>"

terraform apply -var="project=dezoomcamp-finalproject"

It is possible to see in the GCP console that the Infrastructure was correctly created.

Step 4: Setup of Kaggle API

Create a Kaggle free account
Create an API token:
- Click on your avatar
- Go to Account menu
- Click on the option "Create New API Token"
- Download the json file for local setup
In your local setup, copy the file into the path:

~/.kaggle/

For your security, ensure that other users of your computer do not have read access to your credentials:

chmod 600 ~/.kaggle/kaggle.json

To see all available API options and commands:

 kaggle --help

Step 5: Setup orchestration using Prefect

Setup the prefect server so that you can access the UI. Run the following command in a CL terminal:

 prefect orion start

Access the UI in your browser: http://127.0.0.1:4200/
For the connection with GCP Buckets it is necessary to create a block:

In the side menu click on the option Blocks
Click on the '+' button and select the GCS Bucket option
Fill in the required fields:
In the Gcp Credentials field click on the Add button
Fill in the Block Name field:
Using the service account json file that was downloaded in step 2, copy its content and paste it in the Service Account Info field
Click on the Create button and you will be redirected to the previous GCS Bucket block creation page:
In the Gcp Credentials field select the Gcp credential created previously:
Click on the Create button to create the block

To execute the flow, run the following commands in a different CL terminal than step 1:

python prefect/flows/api_to_gcs_to_bq.py

Step 6: Running the dbt flow

Create a dbt cloud free account
Clone this repo
In the command line of dbt running the following command:

dbt run

dbt lineage generated:

Validation of created tables

Production Table

Check Data in BigQuery:

The data will be available at dezoomcamp-finalproject.dbt_us_traffic_accidents
The production version will be available at dezoomcamp-finalproject.production.dim_us_traffic_accidents (dimension table) and dezoomcamp-finalproject.production.stg_us_traffic_accidents (staging table)

Improvements

Add unit tests
Add CI/CD pipeline
Containerize the project
Perform deeper data analysis