DSTI Data Pipeline 2 Project, source from Professor Jean-Luc Canela

My Project GitHub: https://github.com/Shutima/DSTI_DataPipeline_Spark_Scala

1. Project Overview

This assignment is aimed to use Spark Scala language to query the important information from the data set file and represent the report in json format. It also aims to understand how to write a unit test, how to delivered the project with sbt or maven configuration, and how to package the project automatically the jar file to be run using spark-submit. The project also includes the integration with AWS Glue and AWS S3 services as ETL tool.

2. Deliverables

For this assignment, I use the source code template from Professor's repository from his GitHub.

This documentation description of the work done for the project (this document)
My part of the scala functions are added in my scala project called CreateJSONreport under the directory src/main/scala/CreateJSONreport.scala
The dataset file (Web Server log file) from Prof. Jean-Luc's spark-hands-on wiki. In the code CreateJSONreport.scala, the location of dataset file is hardcoded for val path and need to be modified for the correct location:

After running the scala code CreateJSONreport.scala, it will create the json reports on the project root directory as followed:

myjsonreport_reportByDate: this json report contains the list of number of access by URI for those dates with more than 20,000 connections in the Web server log dataset
myjsonreport_reportByIp: this json report contains the list of number of access per IP address

3. Limitations / Possible Improvements

I was not able to create a jar file using the command "sbt assembly" as learnt in the class because I encountered the following error:

Because of this issue, I could not create a new *.jar file for my completed scala code in order to do the spark-submit part.

4. Getting Started

In order to run this project, you need to have the following installed: For Windows:

JDK: Java Development Kit
Apache Spark: Pre-build version
Correct path for Environment Variables SPARK_HOME and JAVA_HOME
Scala IDE

5. Operating System

This project was implemented on Windows 10 Professional Operating System using the following applications:

IntelliJ IDEA
Spark shell version 3.0.0

6. Integration with AWS Glue and AWS S3 Services as ETL Tool (Extract Transform and Load)

This is a tutorial step-by-step describing how to run this scala program in Amazon. All the source files for this AWS part are from Prof. Jean-Luc's aws-emr-template. And, they are also available in my GitHub Project DSTI_DataPipeline_AWS_ETL

Log into your AWS account
Click on Services to go to all services page.

Preparing S3 Bucket Database if not already existed, look for S3 serice under Storage

Click on Create Bucket

Enter the name of the Bucket

Define the appropriate configure options

Choose Block all public access which is default option

Click Create Bucket, the newly created Bucket will be appeared in the S3 list.

Select the newly created S3 Bucket, there will be a pop-up window on the left with the details of the bucket, click on Copy Bucket ARN, this will copy Bucket ARN to the clipboard, you will need this ARN number later to configure AWS Glue.

Back on the AWS Services home. Look for AWS Glue which is in the Analytics categeory. This is the managed AWS ETL service. Click to go to this service.

From the main menu on the left, click on Databases under Data Catalog

Click on Add Database, then enter anyname for Database name. For Location, paste the ARN Number that was copied from the new S3 Bucket. Remove the begining characters arn:aws: and ending with /database. Then click Create.

Click on newly created database on AWS Glue to access it

Click the link Tables in the database to access the table creation page

Choose Add tables > Add tables using a crawler

Here please go back to S3 Bucket homepage, for this tutorial using crawler in AWS Glue, we have created another S3 Bucket called dsti-a19-incoming1. This S3 bucket will be for keeping the web server access log from Prof. Jean-Luc's spark-hands-on wiki, access.log.gz.

The access log should be downloaded if not already done. Click on the S3 bucket dsti-a19-incoming1 to go to it's homepage, then drag/drop the access.log.gz to upload it on the bucket. Then, click on Upload.

After finish uploading, you will see the access log stored on this S3 bucket.

Back to adding table with crawler configuration page. Add the details of the new crawler.

Add the path to the S3 bucket for the access log.

Add details for IAM role.

Configure the crawler output.

Review everything then click Finish. Now we have a new table using crawler created.

Click on the new crawler and click on Run crawler.

Go to menu Classifiers under Crawlers on the left AWS Glue menu.

Click Add classifier, then enter the details for the classifier that is related to the access log.

Create a simple data.csv file to test this.

Go to the location of data.csv then upload it to the S3 bucket, dsti-a19-incoming1 where we have the access log stored.

Back to the crawler that we created, and ran. Click on the Logs. Here we can see the crawler log.

Now go back to the AWS Glue > Databases > Tables. It created automatically the table using crawler incoming_dsti_a19_incoming1. We can click inside the table to see more details.

However, it cannot identify the type as the classification is unknown. We can go back to AWS Glue > Crawlers, then select the crawler we created dsti-incoming-crawler, then edit the configuration. On the bottom of the edit page, we can click Add to add this classifier of CSV file that we created. Then, keep all the same settings, then click Finish to update the crawler.

Re-run again this crawler. Now, this crawler will run through both files: access log and the new simple csv file that we created as they are both in the same incoming S3 bucket.

Now, when we go back to AWS Glue > Databases > Tables, there are 2 more new tables created for access log and the simple data.csv file.

When clicking in the table incoming_data_csv, we can see that it reads through the data inside the CSV file correctly.

From AWS Services homepage, search for AWS Athena service. Here, you will see the tables from AWS Glue that were created from the crawler. Click on Preview Table

For now, you will get an error no output location provided.

Click on the Settings menu on top right.

Here, you will need to go back to S3 service to create a new bucket for Athena. Go back to S3 service and create a new S3 bucket.

Select the new bucket created for Athena, and click on copy the bucket ARN.

Click to go inside the new bucket for Athen, here you need to create new folder named output

Go back to Athena service, and paste the S3 bucket ARN for Athena bucket. Also add the additional information, normally we should use CSE-KMS or SSE-KMS as Encryption Type. However, here for the sake of the tutorial, we will choose the default SSE-S3.

On Athena, select another table incoming_data_csv and Preview this table, then run the query.

Then go back to the Athena bucket on S3 service, output folder, here you can see all of the action you did in Athena. It will extract the data as per the query ran in Athena.

The Athena servie integrates the S3 buckets that we created and also AWS Glue service. You use Athena service to view the results of the table.