Spark20NewsGroup: A Scala repository from r-wheeler

An implementation of TF-IDF + a Naive Bayes Classifier using Apache Spark and Stanford NLP utils.

Clone the repo and cd into it
Run sbt assembly to build uber jar
Submit by running spark-submit --class com.brokendata.NaiveBayesSpark target/scala-2.10/spark20newsgroup-assembly-1.0.jar from the repo's root.

Make sure you have apache spark installed and in your $PATH, you will most likely need create a $SPARK_HOME/conf/spark-defaults.conf file and the following:

spark.executor.memory 3g
spark.driver.memory 4g

r-wheeler/Spark20NewsGroup