This is a handy collection of bash scripts aimed at simplifying the process of connecting to a Spark cluster with Jupyter. It is tested with Spark clusters on AWS EC2 running the Amazon Linux 2 AMI, launched using Flintrock. There are three scripts each of which accomplishes a specific task.
This script installs jupyterlab
, pyspark
, pandas
, and all dependencies (including Python3 and its dependencies).
It uses Pyenv to make things easier to manage and to keep the system version of Python clean.
It also installs npm and nodejs in case you would like to install jupyterlab
extensions.
The best practice is to create a clean EC2 instance, clone this repository onto the instance, run setup-sparknb.sh
, and create an image from this instance.
You only have to do this once, and then going foward you can use the AMI ID of the image in your Flintrock config file.
Warning: at the time of writing you need to delete the file ~/.ssh/authorized_keys
in the instance before creating the image or else Flintrock will error out when you try to use the image to launch a Spark cluster.
The strategy for connecting to your Spark cluster via Jupyter is to use SSH port forwarding to connect a port on your local machine to a port on the Spark master, and then run a Jupyter server on the master node over the chosen port.
The command flintrock login
doesn't support port forwarding at the moment, so this script constructs and runs the port forwarding command for you.
Example syntax:
./flintrock-sparknb.sh my-flintrock-cluster-name 8888
This command assumes that you have already launched a Spark cluster using Flintrock called my-flintrock-cluster-name
and that you wish to run and connect to the Jupyter server on port 8888.
This script is meant to be used on your local machine; you may find it convenient to create a symlink to this script somewhere on your PATH.
Once you have launched a Spark cluster using an AMI that has jupyterlab
installed (using setup-sparknb.sh
or otherwise) and once you have logged into the master node using SSH port forwarding (using flintrock-sparknb.sh
or otherwise), run this script from the master node using the syntax:
./run-sparknb.sh 8888
(Replace 8888
with whatever remote port you forwarded to.)
You should see output which looks like:
To access the notebook, open this file in a browser:
file:///run/user/1000/jupyter/nbserver-<some-number>-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=<some-token>
Copy and paste the localhost
URL into a browser, and if all goes well you will be able to access the Jupyter server running on the master node.
In any notebook that you create on this server you will have access to two privileged variables: a SparkContext
called sc
and a SparkSession
called spark
.
Use these variables as your entry point for all Spark computations, and those computations will run on the cluster.
Some tips:
- Check the arguments being passed to
pyspark
in this script before running it. For instance, you will probably need to adjust the valueexecutor-memory
depending on the specifications of the EC2 instance types that your cluster is running on. - You might want to run this script in a screen session in case your SSH connection dies. Be warned: if your connection dies or if you close the Jupyter tab in your browser then any jobs you are running will complete but you will lose the output even when you reconnect to the server and reopen your notebook. For long-running jobs it is best to write the output of the computation to your disk or to S3 in the same cell that initiates the computation so that you can reload the output when you reconnect.
- If you would like to use a RDD transformation or UDF which requires a special Python dependency, you should be able to install it using
flintrock run-command pip install my-package
. This of course assumes that Python andpip
have been installed on all of the workers as well as the master.