This is the supporting material for my talk "Data Engineering on GCP" presented at Oxford University as part of the course Artificial Inteligence: Cloud and Edge Implementations.
Available here.
This repo has been tested on Linux Ubuntu and Mac OS X.
If you are using Windows 10 you may use Ubuntu through the Windows Subsystem for Linux.
The following steps assume you have python3 installed.
Follow the instructions at https://cloud.google.com/sdk/docs/downloads-interactive.
Install Java 8 (required for running PySpark locally). On Linux Ubuntu:
sudo apt install openjdk-8-jdk
Create a python3 virtual env before running any of the sample code:
python3 -m venv venv
If the module python3-venv
is not available, you may need to install it:
sudo apt-get install python3-venv
TBD
To activate the environment, use:
source venv/bin/activate
With the virtual env activated, install the requirements file:
pip install -r requirements.txt
To deactivate the environment, after finishing your work, use the command deactivate
.