This Spark Job is a tool to compare large numbers of texts against each other. The Spark Job implementation is inspired by the program SIM of Dick Grune(http://dickgrune.com/Programs/similarity_tester/)
All you have to do is clone the repository, change the config to your needs and run it on your cluster. The Job is written with Scala, Windows Users need to install sbt to package the Jar File.
Unix users can use the provided sbt script for packaging.
git clone https://github.com/MeiSign/simtext4S.git
cd simtext4s
mv resources/application.conf.template resources/application.conf
vim resources/application.conf
./sbt package
git clone https://github.com/MeiSign/simtext4S.git
// edit config file in resources
cd simtext4s
sbt package
Once you have packaged the jar you can run it with the spark-submit script. The settings are depending of the size of data you want to compare.
Example
spark-submit
--class de.simtext.SimText
--executor-memory 2G
--driver-memory 2G
simtext4s-with-spark_2.10-1.0.jar