/Data-Engineering-Projects

Personal Data Engineering Projects

Primary LanguageJupyter Notebook

Description


  • This repo contains projects done which applies principles in data engineering.
  • Notes taken during the course can be found in folder 0. Back to Basics

Projects


  1. Postgres ETL ✔️
  • This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have
  1. Cassandra ETL ✔️
  • Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app
  1. Web Scrapying using Scrapy, MongoDB ETL ✔️
  • In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried.
  • In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.
  1. Data Warehousing with AWS Redshift ✔️
  • This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.
  1. Data Lake with Spark & AWS S3 ✔️
  • This project creates a data lake, in AWS S3 using Spark.
  • Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.
  1. Data Pipelining with Airflow ✔️
  • This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow.
  • Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative
  1. Capstone Project ✔️
  • This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project.
  • My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: Movies Data Warehouse