/Spark20NewsGroup

Naive Bayes + TFIDF in Spark on 20 News Group Dataset

Primary LanguageScala

An implementation of TF-IDF + a Naive Bayes Classifier using Apache Spark and Stanford NLP utils.

  • Clone the repo and cd into it
  • Run sbt assembly to build uber jar
  • Submit by running spark-submit --class com.brokendata.NaiveBayesSpark target/scala-2.10/spark20newsgroup-assembly-1.0.jar from the repo's root.

Make sure you have apache spark installed and in your $PATH, you will most likely need create a $SPARK_HOME/conf/spark-defaults.conf file and the following:

spark.executor.memory 3g
spark.driver.memory 4g