wangruinju/nlp_spark

Natural Language Processing with Spark's MLlib

Jupyter Notebook

#Natural Language Processing with Spark's ML

##Requires

Anaconda Python 3.4
- NLTK
- langid
- findspark (for local spark install only)
Spark 1.6
- Local install OK

#Example Description

How to create a Data Science vs Spam classifier for twitter?
How to choose the right algorithm?
What do I need to start?

##Use PySpark to preprocess text data

Language Classification
Stop Word Removal
Custom Twitter Specific Clean Up
Part of Speech Tagging
Lemmatization/Stemming of Text
General Cleanup

##Converting text to numerical data with ML Pipelines

Tokenization
Term Frequency Hashing
Inverse Document Frequency

##Training & Testing a Model

Crossvalidation with ML Pipeline CrossValidator
Evaluation with ML Pipeline Evaluator

##Watch the Talk

https://www.youtube.com/watch?v=AsW0QzbYVow