This data engineering project is part of the Data Engineering Zoomcamp and focuses on analyzing Uber taxi data. The objective is to apply the knowledge gained during the course to build a comprehensive data pipeline.
The goal of this project is to perform data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.
Questions to answer:
- Find the top ten pick up locations, based on the number of trips
- Find the total number of trips by passenger count
- Find the average fair amount by hour of the day
TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
More info about dataset can be found here:
- Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
The PySpark transformation script includes the necessary PySpark functions and logic to perform the required transformations. However, it assumes that the input DataFrame df
contains the required columns (index
, tpep_pickup_datetime
, tpep_dropoff_datetime
, passenger_count
, trip_distance
, RatecodeID
, pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
).
- Programming Language - Python
Google Cloud Platform
- Terraform Infrastructure
- Google Storage
- Compute Instance
- Mage
- PySpark
- BigQuery
- Looker Studio
Modern Data Pipeine Tool - https://www.mage.ai/
Contibute to this open source project - https://github.com/mage-ai/mage-ai
The project architecture consists of a set of interconnected components designed to handle various aspects of data ingestion, processing, storage, and analysis. The key components are as follows:
- Data Ingestion: Mage is used to get the data, ensuring that the data is collected.
- Data Storage: The data is stored in Google Cloud Storage (GCS) as csv file. GCS provides a scalable and reliable storage solution for the project.
- Data Processing: Mage is utilized to process the data using python and Apache Spark. Enabling data processing tasks, transforming and aggregating the raw data into the uber_dataset tables.
- Data Warehouse: Google BigQuery serves as the data warehouse for the project, storing the processed and aggregated data in the uber_dataset tables. BigQuery allows for efficient querying and analysis of the data, making it an ideal choice for this project.
- Data Visualization: Looker is used for data visualization, enabling users to explore the data and uncover insights about uber trips and patterns. By leveraging Looker's powerful visualization capabilities, stakeholders can easily understand the results of the data analysis and make informed decisions.
- Infrastructure as Code (IaC): Terraform is used to manage the infrastructure components of the project. This IaC approach ensures that the infrastructure is easily reproducible and can be managed in a version-controlled manner.
The project architecture is designed to provide a scalable, reliable, and efficient solution for ingesting, processing, storing, and analyzing Uber Trip data, ultimately enabling stakeholders to make data-driven decisions and improve the system's performance.
-
Clone this repository
-
Refresh service-account's auth-token for this session
gcloud auth application-default login
-
Terraform Change to terraform directory
cd terraform_infra
Initialize terrform
terraform init
Preview the changes to be applied:
terraform plan
It will ask for your GCP Project ID and a name for your DataProc ClusterApply the changes:
terraform apply
-
Install the required Python packages:
open linux.md file and follow the instructions
open pyspark.md file and follow the instructions
open the setup.txt file and run the commands
-
Add Google Cloud Service Account Key to Mage io_config.yaml file
GOOGLE_SERVICE_ACC_KEY_FILEPATH: "/path/to/your/service/account/key.json"
- Mage Docs: a good place to understand Mage functionality or concepts.
- Mage Slack: a good place to ask questions or get help from the Mage team.
- DTC Zoomcamp: a good place to get help from the community on course-specific inquireies.
- Mage GitHub: a good place to open issues or feature requests.