Putting into practice the following concepts:
- Data modeling (Applying Conceptual Modeling, then Construct Fact and Dimension Tables).
- Database Schema (Apply a specific schema to Fact and Dimension Tables, which suits our Data-Size and Structure => Star-Schema).
- Getting the best of both worlds by using a
Data-Lake
, by applying Dimensional Modelling to Data with high/known value and store low/unknonwn value Data that was previously not available for analytics. - ETL Pipeline (Construct an ETL Pipeline to Extract Data From Log Files on S3 Bucket(that acts as a Data-Lake), then, apply various transformation needed on the Data using
Apache Spark
, before inserting Data into Fact and Dimensional Tables, then write(load) Data back toAWS S3
Data-Lake as aParquet files
). - Using
AWS S3
as our Data Lake, due to it's virtually unlimited scalability and the ability to seamlessly increase storage from gigabytes to petabytes of content, paying only for what we use. - Using
Apache Spark
to apply Various Transformations on Big-Data as it is a Distributed Computing framework that process data in-memory, while limiting the writing of intermediate results to the Disk as possible, which makesApache Spark
very fast and efficient for processing Big-Data. - Using
Apache Parquet
(columnar storage format) to apply Compression on Data and maintain it's structure(schema), which allows for lower data storage costs and maximized effectiveness of querying data with serverless technologies likeAmazon Athena
.
A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As the data engineer assigned to the project, my role is to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
- Song Datasets: all json files can be found under
s3a://udacity-dend/song_data/*/*/*/*.json
. A sample of this files is:
{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}
- Log Datasets: all json files can be found under
s3a://udacity-dend/log_data/*/*/*.json
. A sample of this files is:
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
Data is extracted from two types of JSON source files:
- songs data is a subset from the Million Song Dataset.
- users data generated using event simulator Based on the songs dataset.
The schema used for this project is the Star Schema: There is one main fact table containing all the measures associated with each event songplays
, and 4-dimensional tables songs
, artists
, users
and time
. Which will contain clean data that is suitable for OLAP(Online Analytical Processing) operations.
We used a Data Lake, Due to:
- The wide variety of Data Formats and structures.
- It becomes impossible to conform to a single rigid representation of Data, Due to the emerging of new roles as Data Scientist.
- The
Agile
andad-hoc
nature of Data Exploration activities needed by Data Scientists. - The wide spectrum of Data transformation needed by the Advanced analytics as
Machine Learning
,Graph Analytics
andRecommender Systems
. - Massive parallelism and scalability come out of the box by using
Apache Spark
andAWS S3
as an Example. - Using of Columnar Storage as
Parquet
, without expensiveMPP DBs
asAmazon Redshift
.
- Filter Log user Data with the
page=NextSong
, to know what song will each user like to listen to next (to help analytics team to continue finding insights about what songs their users love to listen to). - Converting
Unix time
(Unix epoch format which is the number of seconds since January 1, 1970) to aTimestamp
usingSpark
DataFrame API. - Extract hour, day, dayOfMonth, weekOfYear, month, and year From the Timestamp into the
time table
.
- etl.py -> this script will load data from
S3 Bucket
, apply transformations on it usingApache Spark
, then write them back toS3 Bucket
in a columnar format(Parquet). - dl.cfg -> Configurations file that Conatins
AWS Credientials
needed to acess theS3 Bucket
to load the transformed data on it.
- Using
anaconda
package manager install:
Python
2.7 or above.Java
jdk8.pyspark
package
- Run etl.py script from terminal as
python etl.py
.