In this repo, we see
- how to create a Google Cloud Platform (GCP) Spark cluster with JupyterLab installed
- how to submit a PySpark Job to this cluster
- in
notebooks/jupyter
: Hands-On introduction to key concepts of Sparkspark-intro
: Lessons learned after my first Spark application, building a recommender with cross join of customers and products involved
Please make sure that your Google Cloud SDK is >=243.0.0.
First, reate a bucket on Google Cloud Storage (GCS). For example: gs://spark-intro
.
Be sure to add your bucket in the YAML key configBucket
of your cluster configuration YAML. There, you can also see the optionalComponents
Anaconda and Jupyter.
Start a Dataproc cluster with Jupyter installed using
gcloud beta dataproc clusters import INSERT_CLUSTER_NAME \
--source dataproc-jupyter-cluster.yaml \
--region=europe-west1 \
--project=YOUR_PROJECT
You can also choose other regions.
You need to upload the Jupyter notebook after the cluster initialization. Use
gsutil -m cp -r . gs://YOUR_BUCKET
Go to the Web Interfaces tab and open JupyterLab
. The working dir is in your config bucket at notebooks/jupyter
.
Please follow the instructions on the GCloud SDK documentation to submit PySpark jobs via the SDK.
For example, to create the transaction data, you can do:
gcloud dataproc jobs submit pyspark datagen/create_transactions.py \
--cluster=YOUR_CLUSTER_NAME \
--region=YOUR_REGION
As said in the Spark Docs,
we can provide dependencies as py
, zip
or egg
files. From my experience, it's very convenient to build an egg
using setup.py
and use this as the dependency.