zigeuner/wikipmodel

parse + load Wikipedia dump into Spark RDD, then use in Naive Bayes model

Scala

Bare bones application which reads a Wikipedia xml dump, parses it into TF-IDFs and uses them in a Naive Bayes framework to do topic assignments.

Usage:

sbt run

Testing

sbt test

History

This is a mash-up of two blog posts with some adjustments

based on Chimpler's blog-spark-naive-bayes-reuters published in:

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

https://github.com/chimpler/blog-spark-naive-bayes-reuters

and borrows some xml parsing methodlogy from:

http://tuxdna.wordpress.com/2014/02/03/a-simple-scala-parser-to-parse-44gb-wikipedia-xml-dump/