hxb087/NLTK-pyspark

Example repository for NLTK execution on PySpark cluster with Cloudera Data Science Workbench

Python

NLTK-example

This example shows how to distribute PySpark with python packages. It is based on this blog.

How to use

Open workbench with Python and run setup.sh
Set environmental valiable PYSPARK_PYTHON as ./NLTK/nltk_env/bin/python
Reopen workbench and run pyspark_nltk.py

Key points for destribute python packages with conda

Create conda environment and zip them.
Set spark.yarn.appMasterEnv.PYSPARK_PYTHON with your conda environment in spark-defaults.conf
- e.g.) spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python
Set environmental variable: PYSPARK_PYTHON=./NLTK/nltk_env/bin/python