/udacity_data_engineering

Repository for projects developed in Udacity's Data Engineering Nanodegree.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Udacity Data Engineering

License: GPL v3 Linkedin Badge

Repository for projects developed in Udacity's Data Engineering Nanodegree.

Project 1: Data Modeling with PostgreSQL

Short description: Relational database modelling using PostgreSQL to model user activity data for a music streaming app.

Tools and technologies: Python, PostgreSql, Star Schema, ETL pipelines, Normalization.

Project 2: Data Modeling with Apache Cassandra

Short description: NoSQL database design using Apache Cassandra.

Tool and technologies: Python, Apache Cassandra, Denormalization.

Project 3: Data Warehouse - Amazon Redshift

Short description: Database warehouse design on Amazon Redshift.

Tools and technologies: Python, Amazon Redshift, aws cli, Amazon SDK, SQL, PostgreSQL.

Project 4: Data Lake - Spark

Short description: Scaled up ETL pipelines by moving the data warehouse to a data lake.

Tools and technologies: Spark, S3, EMR, Athena, Amazon Glue, Parquet.

Project 5: Data Pipelines - Airflow

Short description: Automation of ETL pipeline and creation of data warehouse using Apache Airflow.

Tool and technologies: Apache Airflow, S3, Amazon Redshift, Python.

Contributing

I use this for my own projects, I know this might not be the perfect approach for all the projects out there. If you have any ideas, just [open an issue][issues] and tell me what you think.

If you'd like to contribute, please fork the repository and make changes as you'd like. Pull requests are warmly welcome.

License

Distributed under the GPL License. See LICENSE for more information.