Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

Week 1: Wikipedia

→ Your overall score for this assignment is 10.00 out of 10.00

Week 2-3: StackOverflow

→ Your overall score for this assignment is 10.00 out of 10.00

Week 4: Time usage

→ Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

Stages 1 and 2: load questions and answers.
Stage 3: groupedPostings, scoredPostings, vectorPostings
Stage 4: sampleVectors

K-Means clustering

Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install

Wget archive from website
Untar
Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start
Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")

Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")

%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87

arnaudj/mooc-spark-coursera-bigdata-analysis-spark-epfl

Big Data Analysis with Scala and Spark

Assignments

Details

Week 2-3: StackOverflow

Extracting vectors

K-Means clustering

Week 4: Time usage

Displaying data with Zeppelin

Install

Prepare data export

Zeppelin