This example shows how to distribute PySpark with python packages. It is based on this blog.
- Open workbench with Python and run
setup.sh
- Set environmental valiable
PYSPARK_PYTHON
as./NLTK/nltk_env/bin/python
- Reopen workbench and run pyspark_nltk.py
- Create conda environment and zip them.
- Set
spark.yarn.appMasterEnv.PYSPARK_PYTHON
with your conda environment inspark-defaults.conf
- e.g.)
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python
- e.g.)
- Set environmental variable: PYSPARK_PYTHON=./NLTK/nltk_env/bin/python