/spark-playground

Spark playground for ActiveWizards test task

Primary LanguageScala

Spark-playground

Some Spark Scala code to process text files for Active Wizards test task

What it does:

  • Loads files from /opt/activeWizards/input/*.txt
  • Removes punctuation, numbers, linebreaks, whitespace sequences
  • Detects language
  • Tokenizes formatted text into words
  • Removes stopwords for detected languages
  • Lemmatizes English words, leaves words of other languages as they are
  • Saves formatted and lemmatized text into /opt/activeWizards/output/*filename*.txt
  • Counts word occurences with CountVectorizer
  • Takes top 30 keywords for file
  • Saves keywords into /opt/activeWizards/output/keywords/*filename*.txt

Used libraries

  • Spark Core
  • Spark MlLib
  • Spark SQL
  • Optimaize/Language-detector
  • Stanford CoreNLP with default model
  • Apache Kryo for Serialization in Spark

Possible improvements