/Udacity-Data-Engineering

my solutions to the course "Udacity Data Engineering Nanodegree"

Primary LanguagePython

Data Engineering Nanodegree Program

This nanodegree program is designed to learn data model architecture, data lakes and warehouses, data pipeline automation and working with massive datasets.

SQL and Python programming skills are used to build the project solutions.

Nanodegree program details: Udacity's Data Engineering Nanodegree.


Project 1 - Data Modeling with Postgres

The purpose of the project is understanding what songs users are listening and analyzing the data based on songs and user activity on the app.

Project 2 - Data Modeling with Apache Cassandra

In this project, ETL pipeline has been created to read the given csv file and implemented into Apache Cassandra. This task has been completed with the steps below.

  • Merge all provided csv files into one file
  • Design the queries to be implemented
  • Create the tables based on the queries

Project 3 - Cloud Data Warehouses

A music streaming startup, Sparkify would like to move their process to cloud services. The data resided in S3 as a JSON file and should be transfered in Amazon Redshift using ETL pipeline. This will help analytics team to collaborate better and continue finding insights in the user activity.

Project 4 - Data Lakes with Spark

In this project, data has been extracted from a AWS S3 bucket. The data processed, fact and dimension tables have been created. The final output has been load back into S3. This process has been deployed in Spark session.

Project 5 - Data Pipelines with Airflow

In this project, ETL pipeline is built on cloud using AWS Redshift and populated via Apache Airflow. The star-schema was used for dimensional model. The data consists of listening event logs from a music app Sparkify and data about songs, artists, and users. The process looks like the following:

  1. Stage the logs from S3 to staging tables in Redshift using a custom Airflow Operator
  2. Move data from staging tables to our star schema tables using PostgresOperator
  3. Check Data quality on the tables using custom Airflow operator

Project 6 - Capstone Project

This project aims to analyze immigration events using I94 Immigration data and city temperature data. Joining these two datasets will provide us a wider range of motion to complete this task.

Nanodegree