LDA clustering with pyspark mllib

see the detailed report in notebook here

How to run this PySpark job on EMR?

Submitting the job to EMR is easy

RUN: bash emr.sh

Under the hood following things happened:

Python code including settings will be packaged into local ./dist folder as EMR job artifacts (see more in package.sh)
Local ./dist then is uploaded to s3 under artifact prefix. (later referenced by EMR step)
Create new EMR cluster, it read raw data from source_json, run ETL job
Finally, it saves outputs (such as, transformed data, model, and visualization data files) under output prefix in s3.

see s3 folder structure: (logs is EMR log folder)

Create virtual dev environment via tox

RUN: tox -e dev

ACTIVATE: :{your project root}$ source .tox/dev/bin/activate

OPEN NOTEBOOK: jupyter notebook

Happy coding!

tox -e test