/pyspark-boilerplate-mehdio

Pyspark boilerplate for running prod ready data pipeline

Primary LanguagePythonMIT LicenseMIT

PySpark Boilerplate Mehdio 🔥

Introduction

Running a prod-ready pyspark app can be difficult for many reasons : packaging, handling extra jars, easy local testing.

This boilerplate solves those problems by providing :

  • Proper Folder structure for ETL applications
  • Logging, configuration, spark session helpers
  • Tests example (with data!)
  • Helpers functions for packaging your application and spark-submit command examples
  • A dev docker image (to be used with VS code or through the make docker commands) for a smooth local development.

This project is initially a fork from : https://github.com/AlexIoannides/pyspark-example-project

Requirements :

  • Docker
  • make
  • Any Cloud Service that can run pyspark (AWS Glue, AWS EMR, GCP DataProc...)

Development

Build the dev image

make build-docker

Run the tests

make test-docker

Run the spark job demo_job

make run-docker

Folder Structure

├── LICENCE
├── Makefile
├── README.md
├── datajob
│   ├── cli.py // Entry point of the spark job
│   ├── configs // hold static config of the etl
│   ├── helpers 
│   └── jobs // main logic
├── docker // for dev container
├── jars // extra jars (for reading from excel file)
├── poetry.lock
├── pyproject.toml
├── setup.cfg
├── setup.py
└── tests // test with fixtures data

Configuration

Static configuration can be done through ./datajob/configs/etl_config.py The job name is provided at run time through --job-name (here demo_job as value). For instance :

	spark-submit \
	--jars jars/spark-excel_2.11-0.9.9.jar \
	--py-files datajob.zip \
	datajob/cli.py \
	--job-name demo_job

will run the job datajob/jobs/demo_job.py and the associated config value from ./datajob/configs/etl_config.py.

How to package my application and run a spark-submit job

Packaging your source code and python dependencies

You need the following files/zip :

  • python dependencies that include your spark code(as .zip)
  • your external jars if your cluster is not connected to the internet(see in this boilerplate with spark-excel lib to be able to read excel files with spark) or
  • File entry point : cli.py (located in datajob/cli.py)

To run a ready depencies folder as .zip run :

make package

will generate a datajob.zip

Writing your first data pipeline

The differents jobs (data pipeline) will be put under datajob/jobs/my_job.py All jobs are considered as module, so that you can launch a specific job directly from the spark-submit command with the "--job-name" argument.

E.g, we have "demo_job" module in datajob/jobs/demo_job.py

spark-submit [...] datajob/cli.py --job-name demo_job

Launching your spark job

	spark-submit \
	--jars jars/spark-excel_2.11-0.9.9.jar \
	--py-files datajob.zip \
	datajob/cli.py \
	--job-name demo_job

--jars : Your local jar dependencies, if you are connected to the internet, use --package com.maven.path to directly pull from maven for example. In this boilerplate, is to show the use of a lib to read excel file in spark, see in demo.job.py.

--py-files: python libs, python source code

datajob/cli.py: the entry point file.

--job-name: custom job parameter Job name, which is a module.

Extra ressources

https://github.com/AlexIoannides/pyspark-example-project https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f https://stackoverflow.com/questions/47227406/pipenv-vs-setup-py https://pipenv.readthedocs.io/en/latest/advanced/#pipfile-vs-setup-py https://realpython.com/pipenv-guide/