Boiler-plate code to start you Python Spark project
This repo contains boiler-plate code for kickstarting any Spark repository based on Python with unit tests prepared. Moreover, it also adds setup of local docker containers to test locally integration with AWS cloud resources, such as AWS S3 buckets. Bellow you can find more instructions how to setup locally your Python environment to develop for this repository.
- Setup Anaconda.
- Setup conda environment
- Setup local env
To download Anaconda package manager, go to: https://www.continuum.io/downloads.
After installing locally the conda environment, proceed to setup this project environment.
For dependency management we are using conda-requirements.txt and requirements.txt. Feel free to add your own packages in those files.
Please "cd" into the current reposotory and build your conda environment based on those conda-requirements and requirements:
conda create -n pyspark python=3.6
source activate pyspark
conda install --file conda_requirements.txt
pip install -r pip_requirements.txt
Please note that we did not add on purpose PySpark pip package, since this will not be a requirement in your production Spark Cluster environment.
However, in tests/pip_requirements.txt
you can find the required PySpark packge to install locally in case you have not manually configured PySpark (old school style).
To deactivate this specific virtual environment:
source deactivate
If you need to completely remove this conda env, you can use the following command:
conda env remove --name pyspark
Some of the existing unit tests require specific python packages, which are not necessarily the exact same as in production. For example, we need to create a pyspark env for some of the tests.
Thus, it is highly advisable that you install those packages to the existing local conda env:
cd tests
source activate pyspark
pip install -r pip_requirements.txt
We can finally run our tests:
source activate pyspark
python -m pytest tests/
Note: if you run into issues regarding python path, you may find the following command useful:
export PYTHONPATH=$PYTHONPATH:.
If you would like to run the Spark locally.
Note that we also provided instructions to simulate interacting with AWS Cloud and other type of resources such as AWS S3 buckets and DBs, using docker to the rescue. Here is a list of steps to get you started.
This step is completely optional, and is only intended if you want to test using an S3 bucket.
make start
This will use the provided Makefile to configure a local mock AWS profile (with fake credentials), and start a Docker S3 bucket.
You should also create the appropriate bucket for this job, for example as such:
aws s3 mb s3://my-bucket --profile spark_mock_env --endpoint http://127.0.0.1:8000
Then copy your sample data into it:
aws s3 cp mydata.json s3://my-bucket/raw-data/ --profile spark_mock_env --endpoint http://127.0.0.1:8000
To confirm that data is there:
aws s3 ls s3://my-bucket/raw-data/ --profile spark_mock_env --endpoint http://127.0.0.1:8000
If your Spark Application is interacting with a Database and AWS S3 buckets, you should download the respective jars, such as:
- for postgres (please adjust to your required version): postgresql-42.1.1.jar
- for AWS (please adjust to your required version, but know that it might be tricky): aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
You might want to set the environment variables, so that Spark knows how to interact with different resources. If your Spark Application is interacting with a Database and AWS S3 buckets, here is a suggested way to configure your environmental variables:
export PYSPARK_SUBMIT_ARGS=--jars ~/.ivy2/jars/postgresql-42.1.1.jar,~/.ivy2/jars/aws-java-sdk-1.7.4.jar,~/.ivy2/jars/hadoop-aws-2.7.3.jar pyspark-shell
export PYSPARK_PYTHON=~/anaconda/envs/pyspark/bin/python
Two important notes here: 1) please replace "~" with your own path where you downloaded, 2) adjust the path to which you downloaded the jars that you downloaded.
Last, but not least, if you are using a local Database, such as Postgres, you might want to export locally its password:
export DB_PASSWORD=<pwd>
The sample application is extremely simple, so you can simply run it from the root of the repository directory with the following commands:
export PYTHONPATH=$PYTHONPATH:.
source activate pyspark
python src/job.py --config=$(PWD)/conf/config_local.txt
The sample application uses a configuration file located conf/config_local.txt
, which you can use to customize Application parameters.
Pro tip: when running locally with docker, sometimes the job hangs when loading from S3. Just restart the S3 docker container, and you should be fine