Some Spark Scala code to process text files for Active Wizards test task
What it does:
- Loads files from /opt/activeWizards/input/*.txt
- Removes punctuation, numbers, linebreaks, whitespace sequences
- Detects language
- Tokenizes formatted text into words
- Removes stopwords for detected languages
- Lemmatizes English words, leaves words of other languages as they are
- Saves formatted and lemmatized text into /opt/activeWizards/output/*filename*.txt
- Counts word occurences with CountVectorizer
- Takes top 30 keywords for file
- Saves keywords into /opt/activeWizards/output/keywords/*filename*.txt
- Spark Core
- Spark MlLib
- Spark SQL
- Optimaize/Language-detector
- Stanford CoreNLP with default model
- Apache Kryo for Serialization in Spark
- Move to https://github.com/clulab/processors or other Scala wrapper for CoreNLP
- Lemmatizer for non-English languages. Possible tools:
- Get rid of NotSerializableException from Pipeline and LanguageDetector somehow