/mooc-spark-coursera-bigdata-analysis-spark-epfl

Assignments code for 'Big Data Analysis with Scala and Spark' course (Coursera EPFL)

Primary LanguageScala

Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

  • Week 1: Wikipedia

Your overall score for this assignment is 10.00 out of 10.00

  • Week 2-3: StackOverflow

Your overall score for this assignment is 10.00 out of 10.00

  • Week 4: Time usage

Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

  • Stages 1 and 2: load questions and answers.

  • Stage 3: groupedPostings, scoredPostings, vectorPostings

  • Stage 4: sampleVectors

vectors

K-Means clustering

  • Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

kmeans

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install
  • Wget archive from website

  • Untar

  • Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start

  • Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")
  1. Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …​)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")
%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

bar graph

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87