This repository contains the course project for the Data Engineering Zoomcamp (Cohort 2023) organized by the by DataTalks.Club community. The project covers main data engineering skills taught in the course:
- Workflow Orchestration: Data Lake, Prefect tool, ETL with GCP & Prefect
- Data Warehouse: BigQuery
- Analytics engineering: dbt (data build tool)
- Data Analysis: Looker Studio
US car crash dataset (covers 49 states). Crash data is collected from February 2016 to December 2021 using various APIs that provide streaming traffic incident (or event) data. These APIs transmit traffic data captured by a variety of entities, such as US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors on road networks. There are currently around 2.8 million crash records in this dataset.
The dataset has 47 columns, but for the present project I decided to select only the relevant columns for my analysis. The following columns will be used:
# | Attribute | Description |
---|---|---|
1 | ID | This is a unique identifier of the accident record. |
2 | Severity | Shows the severity of the accident, a number between 1 and 4. 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). |
3 | Start_Time | Shows the start time of the accident in local time zone. |
4 | End_Time | Shows the end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed. |
5 | Description | Shows the natural language description of the accident. |
6 | Street | Shows the street name in address field. |
7 | City | Shows the city in address field. |
8 | State | Shows the state in address field. |
9 | Country | Shows the country in address field. |
10 | Weather_Condition | Shows the weather condition (rain, snow, thunderstorm, fog, etc.) |
11 | Sunrise_Sunset | Shows the period of day (i.e. day or night) based on sunrise/sunset. |
More information about this dataset: Author blog and Kaggle
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
- Google Cloud Platform (GCP):
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure as code (IaC)
- dbt: Data Transformation
- Pandas: Data Analysis & Exploration
- Prefect: Workflow Orchestration
- Looker Studio: Visualize Data
# | Attribute | Description |
---|---|---|
1 | accident_id | This is a unique identifier of the accident record. |
2 | severity_id | Shows the severity of the accident, a number between 1 and 4. 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). |
3 | start_date | Shows start date of the accident was started. |
4 | end_date | Shows the end date of the accident was ended. |
5 | start_time | Shows the start time of the accident in local time zone. |
6 | end_time | Shows the end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed. |
7 | description | Shows the natural language description of the accident. |
8 | street | Shows the street name in address field. |
9 | city | Shows the city in address field. |
10 | state | Shows the state in address field. |
11 | country | Shows the country in address field. |
12 | weather_condition | Shows the weather condition (rain, snow, thunderstorm, fog, etc.) |
13 | sunrise_sunset | Shows the period of day (i.e. day or night) based on sunrise/sunset. |
- Partition by column start_date, more specifically by year to obtain annual granularity
- Clustering by column country to group data that have the same country value
Benefits of combining clustered and partitioned tables: Combining clustered and partitioned tables
- Which State/City/Street in US has reported most number of Accident Cases between 2016 and 2021?
- How are the weather conditions in most of the accident cases in US?
- Did most accidents occur at night or during the day?
More detailed analysis of the results obtained: Data Analysis
- Clone the repo into your local machine:
git clone git@github.com:tmaferreira/DataEngineeringZoomCampProject.git
- Install all required dependencies into your environment
pip3 install -r requirements.txt
-
Create a Google Cloud Platform (GCP) free account with your Google e-mail
-
Create a new GCP project with the name dezoomcamp-finalproject (Note: Save the assigned Project ID. Projects have a unique ID and for that reason another ID will be assigned)
-
Create a Service Account:
- Go to IAM & Admin > Service accounts > Create service account
- Provide a service account name and grant the roles: Viewer + BigQuery Admin + Storage Admin + Storage Object Admin
- Download the Service Account json file
- Download SDK for local setup
- Set environment variable to point to your downloaded GCP keys:
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
# Refresh token/session, and verify authentication gcloud auth application-default login
-
Enable the following APIs:
- Install Terraform
- Copy files (main.tf and variables.tf) for the infrastructure creation (Use files created in Zoomcamp course: Terraform files)
- In the file variables.tf change variable BQ_DATASET to: us_traffic_accidents_data
- Execute the following commands to plan the creation of the GCP infrastructure:
# Initialize state file (.tfstate)
terraform init
# Check changes to new infra plan
# -var="project=<your-gcp-project-id>"
terraform plan -var="project=dezoomcamp-finalproject"
# Create new infra
# -var="project=<your-gcp-project-id>"
terraform apply -var="project=dezoomcamp-finalproject"
It is possible to see in the GCP console that the Infrastructure was correctly created.
-
Create a Kaggle free account
-
Create an API token:
- Click on your avatar
- Go to Account menu
- Click on the option "Create New API Token"
- Download the json file for local setup
-
In your local setup, copy the file into the path:
~/.kaggle/
- For your security, ensure that other users of your computer do not have read access to your credentials:
chmod 600 ~/.kaggle/kaggle.json
To see all available API options and commands:
kaggle --help
- Setup the prefect server so that you can access the UI. Run the following command in a CL terminal:
prefect orion start
- Access the UI in your browser: http://127.0.0.1:4200/
- For the connection with GCP Buckets it is necessary to create a block:
-
In the side menu click on the option Blocks
-
Click on the '+' button and select the GCS Bucket option
-
In the Gcp Credentials field click on the Add button
-
Using the service account json file that was downloaded in step 2, copy its content and paste it in the Service Account Info field
-
Click on the Create button and you will be redirected to the previous GCS Bucket block creation page:
-
In the Gcp Credentials field select the Gcp credential created previously:
-
Click on the Create button to create the block
- To execute the flow, run the following commands in a different CL terminal than step 1:
python prefect/flows/api_to_gcs_to_bq.py
- Create a dbt cloud free account
- Clone this repo
- In the command line of dbt running the following command:
dbt run
dbt lineage generated:
Check Data in BigQuery:
- The data will be available at dezoomcamp-finalproject.dbt_us_traffic_accidents
- The production version will be available at dezoomcamp-finalproject.production.dim_us_traffic_accidents (dimension table) and dezoomcamp-finalproject.production.stg_us_traffic_accidents (staging table)
- Add unit tests
- Add CI/CD pipeline
- Containerize the project
- Perform deeper data analysis