/nlp_spark

Natural Language Processing with Spark's MLlib

Primary LanguageJupyter Notebook

#Natural Language Processing with Spark's ML

##Requires

  • Anaconda Python 3.4
    • NLTK
    • langid
    • findspark (for local spark install only)
  • Spark 1.6
    • Local install OK

#Example Description

  • How to create a Data Science vs Spam classifier for twitter?
  • How to choose the right algorithm?
  • What do I need to start?

##Use PySpark to preprocess text data

  • Language Classification
  • Stop Word Removal
  • Custom Twitter Specific Clean Up
  • Part of Speech Tagging
  • Lemmatization/Stemming of Text
  • General Cleanup

##Converting text to numerical data with ML Pipelines

  • Tokenization
  • Term Frequency Hashing
  • Inverse Document Frequency

##Training & Testing a Model

  • Crossvalidation with ML Pipeline CrossValidator
  • Evaluation with ML Pipeline Evaluator

##Watch the Talk