/oreilly_dataeng_book

Cost effective data pipelines code repository

Primary LanguageJupyter Notebook

Cost Effective Data Pipelines example code repository

This repo contains code examples for Cost Effective Data Pipelines Available for purchase wherever books are sold including:

Please send any comments, concerns, or problems to sev@thedatascout.com

Environment setup

  1. Install pyenv
  2. Install pyenv-virtualenv
  3. Install python 3.8.5
    pyenv install 3.8.5
  4. Create virtualenv
    pyenv virtualenv 3.8.5 oreilly-book
  5. Activate the virtual environment
    pyenv activate oreilly-book
  6. Clone this repo
    git clone git@github.com:gizm00/oreilly_dataeng_book.git
  7. cd oreilly_dataeng_book
  8. pip install wheel
  9. Install dependencies
    python -m pip install -r requirements.txt

Running Spark locally

(based on these instructions)
Within the virtualenv created above run the following:

  1. Download apache-spark This material was developed using spark 3.2.1 with hadoop 3.2
  2. Move the tgz file to a place you will refer to it from, i.e. ~/Development/
  3. tar -xvf ~/Development/spark-3.2.1-bin-hadoop3.2.tgz
  4. Add the following to your shell startup file, for example ~/.bash_profile:
export SPARK_HOME="/User/sev/Development/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"
  1. source ~/.bash_profile
  2. pyspark

If you use the VSCode IDE on OSX, you can run pyspark notebooks with these instructions

  • When you start the notebook in VS Code choose the oreilly-book venv as the python interpreter path