Bare bones application which reads a Wikipedia xml dump, parses it into TF-IDFs and uses them in a Naive Bayes framework to do topic assignments.
sbt run
sbt test
This is a mash-up of two blog posts with some adjustments
based on Chimpler's blog-spark-naive-bayes-reuters published in:
https://github.com/chimpler/blog-spark-naive-bayes-reuters
and borrows some xml parsing methodlogy from:
http://tuxdna.wordpress.com/2014/02/03/a-simple-scala-parser-to-parse-44gb-wikipedia-xml-dump/