data-cleansing: A Scala repository from DBDA15

DATA CLEANSING

This application uses the Apache Spark/Flink Frameworks to find similar users in the Netflix user data set. Users are compared based on the movies they rated, the rating value itself gets ignored.

The algorithm as well as results are discussed in Chapter 8 (p.20–24) of our technical report Distributed Data Analysis with Apache Spark and Apache Flink.

BUILD

Use Java 7 and maven package

APACHE SPARK

Sample execution:

spark-submit --class de.hpi.fgis.willidennis.Main --master spark://172.16.21.111:7077  --conf spark.cores.max=10 scala/target/findSimilarNetflix-0.0.1-SNAPSHOT.jar --TRAINING_PATH training_set/

Program arguments:

SIM_THRESHOLD: Threshold for Jaccard comparison; default: 0.9
SIGNATURE_SIZE: number of movies used for signature; default: 1
TRAINING_PATH: path to dataset
OUTPUT_FILE: target path for output data
EXECUTION_NAME: name this execution (Spark specific). Appears in the dashboard
MEMORY: Memory to be used
MASTER: Address of spark master node

APACHE FLINK