/cookiecutter-pyspark-cloud

A cookiecutter template for working with PySpark on AWS EMR

Primary LanguagePythonOtherNOASSERTION

cookiecutter-pyspark-cloud

Made with PythonBuilt with Love

Built for Data ScientistsCloudLicense

Run PySpark code in the 'cloud' with Amazon Web Services (AWS) Elastic MapReduce (EMR) service in a few simple steps with this cookiecutter project template!

Quickstart

pip install -U "cookiecutter>=1.7"
cookiecutter --no-input https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd pyspark-cloud
make install
pyspark_cloud

Your console will look something like:

pyspark_cloud command-line banner

Features

  • AWS ☁️ Cloudformation Template for EMR: Simple Spark cluster deployment with infrastructure as code

  • A Command-Line Interface for Running PySpark 'Jobs': For production 🚀 runs via EMR Step API

  • Log Like a Pro: Save time debugging in style 💃

  • Wrap Scala with Python 🐍: Use libraries that haven't been included in the PySpark API!

    • An example of wrapping Scala Spark API code with PySpark API code is provided with SnowballStemmer
    • Could be extended to other Scala MLlib classes (and other Scala classes that implement the UDF interface)
  • Simplify Workflows with Make ✅: A Makefile with commands for installation, development, and deployment.

    • use with make [COMMAND]
    • For example, distribute an executable .egg 🥚 distribution of your PySpark code to AWS S3 with make s3dist
  • Organize Your Code: Package code shared between 'jobs' in a Python module of your package called common

  • Extend the PySpark API: An example of extending the PySpark SQL DataFrame class, which allows chaining custom transformations with dot . notation

  • Development Framework: All the tools you need

Infrastructure Overview

As defined in the Cloudformation template AWS Cloudformation template InfViz.io diagram

Usage

  1. Clone this repo:
git clone https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd cookiecutter-pyspark-cloud
  1. Create a Python environment with dependencies installed:
conda create -n cookiecutter -y "python=3.7"
pip install -r requirements.txt

conda activate cookiecutter
  1. Make any changes to the template, as you wish.

  2. Create your project from the template:

cd ..
cookiecutter ./cookiecutter-pyspark-cloud
  1. Initialize git:
cd *your-repo_name*
git init
git add .
git commit -m "Initial Commit"
  1. Create a new Conda environment for your new project & install project development dependenices:
conda deactivate
conda create -n *your-repo_name* -y "python=3.6"

make install-dev

Contribute

Contributions are welcome! Thanks!

Submit an Bug or Feature Request

Submit a Pull Request

Acknowledgements

Most of the ideas expressed in this repo are not new, but rather expressed in a new way. Thanks, folks! 🙌