Running a prod-ready pyspark app can be difficult for many reasons : packaging, handling extra jars, easy local testing.
This boilerplate solves those problems by providing :
- Proper Folder structure for ETL applications
- Logging, configuration, spark session helpers
- Tests example (with data!)
- Helpers functions for packaging your application and spark-submit command examples
- A dev docker image (to be used with VS code or through the
make
docker commands) for a smooth local development.
This project is initially a fork from : https://github.com/AlexIoannides/pyspark-example-project
Requirements :
- Docker
- make
- Any Cloud Service that can run pyspark (AWS Glue, AWS EMR, GCP DataProc...)
make build-docker
make test-docker
make run-docker
├── LICENCE
├── Makefile
├── README.md
├── datajob
│ ├── cli.py // Entry point of the spark job
│ ├── configs // hold static config of the etl
│ ├── helpers
│ └── jobs // main logic
├── docker // for dev container
├── jars // extra jars (for reading from excel file)
├── poetry.lock
├── pyproject.toml
├── setup.cfg
├── setup.py
└── tests // test with fixtures data
Static configuration can be done through ./datajob/configs/etl_config.py
The job name is provided at run time through --job-name
(here demo_job
as value).
For instance :
spark-submit \
--jars jars/spark-excel_2.11-0.9.9.jar \
--py-files datajob.zip \
datajob/cli.py \
--job-name demo_job
will run the job datajob/jobs/demo_job.py
and the associated config value from ./datajob/configs/etl_config.py
.
You need the following files/zip :
- python dependencies that include your spark code(as .zip)
- your external jars if your cluster is not connected to the internet(see in this boilerplate with spark-excel lib to be able to read excel files with spark) or
- File entry point : cli.py (located in
datajob/cli.py
)
To run a ready depencies folder as .zip
run :
make package
will generate a datajob.zip
The differents jobs (data pipeline) will be put under datajob/jobs/my_job.py All jobs are considered as module, so that you can launch a specific job directly from the spark-submit command with the "--job-name" argument.
E.g, we have "demo_job" module in datajob/jobs/demo_job.py
spark-submit [...] datajob/cli.py --job-name demo_job
spark-submit \
--jars jars/spark-excel_2.11-0.9.9.jar \
--py-files datajob.zip \
datajob/cli.py \
--job-name demo_job
--jars
: Your local jar dependencies, if you are connected to the internet, use --package com.maven.path to directly pull from maven for example. In this boilerplate, is to show the use of a lib to read excel file in spark, see in demo.job.py.
--py-files
: python libs, python source code
datajob/cli.py
: the entry point file.
--job-name
: custom job parameter Job name, which is a module.
https://github.com/AlexIoannides/pyspark-example-project https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f https://stackoverflow.com/questions/47227406/pipenv-vs-setup-py https://pipenv.readthedocs.io/en/latest/advanced/#pipfile-vs-setup-py https://realpython.com/pipenv-guide/