Data Lake with Apache Spark on AWS.

Summary

Introduction
Project Description
Getting Started
Database Schema
Prerequisites
Installing
Running the tests
Built With
Contributing
Authors
License
Acknowledgments

Introduction

Sparkify is a music streaming startup that provides free and paid on cloud music streaming plans and there are trying to enquire more users with paid plans. So they wanted to move from a local postgress analytic data wharehouse to a cloud based anlytics process using Amazon Redshift so that they can be more flexible on the of analyse their users behavior and how to convert them to paid customers. Now because the amount of data have grown they need a more powerfull data processing tolling at big data scale.

Project Description

The Sparkify currently have a on premisse postgress data werehouse and they want to move there analytics process to the cloud. so they will need to move there song and log data using json files to a amazon S3 storage service as a staging area and after that load on a redshift databese for futher analysis. Because the volume of data and the need of more powerfull data processing tolling they have to use apache spark to move data back and forward from S3 to redshift in parquet format which is more suitable for big data analysis, therefore there an need for an more robust data warehouse. So as proposal we will sugest a data lake on AWS as solution for the new need.

Getting Started

Run Python scripts below

etl.py: Reads data from S3, processes that data using Spark, and writes them back to S3 To run on an Jupyter Notebook powered by an EMR cluster, import the notebook found in this project.

Database Schema

Prerequisites

AWS accout provisioned with, S3 bucket, and IAM role with admin level access to connect a S3 and perform above listed operations, use apache spark as data processing tool. and python 3.x, (local or cloud based) to run the scripts.

Installing

Use the etl.ipynd notebook to develop the ETL process for each of tables before completing running the etl.py file to read data from S3, processes that data using Spark, and writes them back to S3.

Running the tests

Test by running scrits provided by analytics team, and see if the result is what was expected.

Built With

Amazon Redshift - Amazon Cloud Based Database Management System.
Amazon IAM - Amazon Identity and Access Management System
S3 buckets - Amazon storage service
Python - Scripting Language
Apache Spark - Lightning-fast unified analytics engine

Contributing

**Teofilo Carlos Chichume **

Authors

Teofilo Carlos Chichume - Initial work - nhatofo

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Inspiration AWS Documentation, PurpleBooth

nhatofo/udacity_dl