/data-engineering-nanodegree

Projects from Udacity's Data Engineering Nanodegree

Primary LanguagePython

Data Engineering Nanodegree

This repository contains all the projects developed during the course of Udacity's Data Engineering Nanodegree

List of projects

The Nanodegree consists of 5 assignments/projects after each course and a final capstone project:

P1.A: Data Modeling with Postgres

This project comprises the scripts required for building a database (using a star schema) and the ETL processes, for a fictitious startup called Sparkify. This company had been collecting data on user activity from their music streaming application, and storing them as JSON files. However, this rudimentary way of storing data generated some difficulties for them regarding performing queries and extracting insights from the data.

During this project, a Postgres database and relevant tables are setup, to allow the Sparkify analytics team access, aggregate and generate insights from their users’ data.

Check the project here.

P1.B: Data Modeling with Apache Cassandra

This project consists of a notebook that processes data and generates a NoSQL database using Apache Cassandra, for a fictitious startup called Sparkify. The case study was that this company had been collecting data on user activity from their music streaming application and stored it as CSV files but could not query the data and generate insights out of it.

Check the project here.

P2: Data Warehouse (AWS)

This project comprises the scripts required for setting up a Data Warehouse on a Redshift Cluster for a fictitious Company called Sparkify. This company had been collecting data on user activity from their music streaming application, and storing them as JSON files. However, this rudimentary way of storing data generated some difficulties for extracting insights from the data.

During this project, a Redshift cluster, a database (following a star schema) and its relevant tables were setup. Using this database the Sparkify analytics team will access, aggregate and generate insights from their users’ data.

Check the project here.

P3: Data Lake

This project comprises the scripts required for setting up a Data Lake on a using Spark and a S3 bucket. This company had been collecting data on user activity from their music streaming application, and storing them as JSON files. However, this rudimentary way of storing data generated some difficulties when extracting insights from the data.

This directory contains the ETL process, that results in parquet tables following a star schema. Using this tables the Sparkify analytics team will access, aggregate and generate insights from their users’ data.

Check the project here.

P4: Data Pipelines

This project comprises the scripts required for setting up a Data Pipeline a using Apache Airflow for Sparkify. This company had been collecting data on user activity from their music streaming application, and storing them as JSON files. However, this rudimentary way of storing data generated some difficulties when extracting insights from the data.

This directory contains the DAGs, helpers, and operators, that processes data coming from an S3 and adds it to a Redshift database. Using this tables the Sparkify analytics team will access, aggregate and generate insights from their users’ data.

Check the project here.