gizm00/oreilly_dataeng_book

Cost effective data pipelines code repository

Jupyter Notebook

Cost Effective Data Pipelines example code repository

This repo contains code examples for Cost Effective Data Pipelines Available for purchase wherever books are sold including:

Please send any comments, concerns, or problems to sev@thedatascout.com

Environment setup

Install pyenv
Install pyenv-virtualenv
Install python 3.8.5
pyenv install 3.8.5
Create virtualenv
pyenv virtualenv 3.8.5 oreilly-book
Activate the virtual environment
pyenv activate oreilly-book
Clone this repo
git clone git@github.com:gizm00/oreilly_dataeng_book.git
cd oreilly_dataeng_book
pip install wheel
Install dependencies
python -m pip install -r requirements.txt

Running Spark locally

(based on these instructions)
Within the virtualenv created above run the following:

Download apache-spark This material was developed using spark 3.2.1 with hadoop 3.2
Move the tgz file to a place you will refer to it from, i.e. ~/Development/
tar -xvf ~/Development/spark-3.2.1-bin-hadoop3.2.tgz
Add the following to your shell startup file, for example ~/.bash_profile:

export SPARK_HOME="/User/sev/Development/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"

source ~/.bash_profile
pyspark

If you use the VSCode IDE on OSX, you can run pyspark notebooks with these instructions

When you start the notebook in VS Code choose the oreilly-book venv as the python interpreter path