Twitter News Clustering

The goal of this project is to create a continuous feed of relevant world news, based on tweets retrieved from the Twitter API. Due to the potentially large amount of data, the news feed is generated by a distributed system, using Apache Spark (1.6.1), Scala (2.10.5) and Java (1.8). The project was developed in the context of the seminar "Mining Massive Datasets" by Masters students (Daniel Neuschäfer-Rube, Jaqueline Pollak, Benjamin Reißaus) at Hasso Plattner Institute (Potsdam, Germany). The following sections outline how to run the project, details on the algorithms can be found on the wiki pages.

How To Run

  1. Set Up Amazon Cluster (Optional)

    • Upload java8_bootstrap.sh to S3

    • Create EMR Cluster

      • click 'Go to advanced options'
      • emr 4.7.1 with spark, hive, hadoop and ganglia
      • set java 8 as default by adding configuration to Edit software settings
      • machine type: m3.xlarge
      • number of worker instances: 10
      • choose java8_bootstrap.sh as script in bootstrap actions
    • In the following, we describe how to run our application locally. The automation scripts can be used though to run everything on the Amazon cluster.

  2. Build Jar

    • mvn clean package
  3. Run Clustering With One Of Two Modes

    • Cluster tweets from file (generate yourself, or use this sample):

      • spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -input [path to twitter.dat]
    • Cluster tweets from Twitter API (requires config.txt in resources folder):

      • spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -source api
  4. Merge Clustering Results

    • spark-submit --class de.hpi.isg.mmds.sparkstreaming.ClusterInfoAggregation target\SparkTwitterClustering-jar-with-dependencies.jar
  5. Visualize Clustering Results (Details)

    • install node
    • install node packages in webapp folder: npm install
    • run webserver: node server.js
    • open browser: http://localhost:3000/index

Automation Scripts

We have provided the following automation scripts in the folder SparkTwitterClustering/utils/:

  1. buildAndCopyToRemote.sh

    • actions:
      • build fat jar SparkTwitterClustering-jar-with-dependencies.jar on local machine
      • copy SparkTwitterClustering-jar-with-dependencies.jar, runOnCluster.sh, driver_bootstrap.sh and twitter.dat to remote server which is referenced by MASTER_PUBLIC_DNS.
    • command: ./buildAndCopyToRemote.sh [MASTER_PUBLIC_DNS]
    • execution machine: local
  2. driver_bootstrap.sh

    • actions:
      • install git, tmux, zsh, oh-my-zsh
      • configure tmux
      • copy twitter.dat to HDFS
    • command: ./driver_bootstrap.sh
    • execution machine: remote cluster driver
  3. java8_bootstrap.sh

    • actions:
      • install java 8 and set up JAVA_HOME accordingly
    • command: --- supplied during cluster creation
    • execution machine: all cluster nodes
  4. runOnCluster.sh

    • actions:
      • run spark job with different batch sizes, number of executors, number of cores
      • save results in runtime.csv
    • command: ./runOnCluster.sh
    • execution machine: remote cluster drive

Contributors

Daniel Neuschäfer-Rube

Jaqueline Pollak

Benjamin Reißaus