Spark-playground

Some Spark Scala code to process text files for Active Wizards test task

What it does:

Loads files from /opt/activeWizards/input/*.txt
Removes punctuation, numbers, linebreaks, whitespace sequences
Detects language
Tokenizes formatted text into words
Removes stopwords for detected languages
Lemmatizes English words, leaves words of other languages as they are
Saves formatted and lemmatized text into /opt/activeWizards/output/*filename*.txt
Counts word occurences with CountVectorizer
Takes top 30 keywords for file
Saves keywords into /opt/activeWizards/output/keywords/*filename*.txt

pedorich-n/spark-playground