DataprocSpawner

DataprocSpawner enables JupyterHub to spawn single-user [jupyter_notebooks][Jupyter notebooks] that run on Dataproc clusters. This provides users with ephemeral clusters for data science without the pain of managing them.

  • Product Documentation
  • DISCLAIMER: DataprocSpawner only supports zonal DNS names. If your project uses global DNS names, click this for instructions on how to migrate.

Supported Python Versions: Python >= 3.6

Before you begin

In order to use this library, you first need to go through the following steps:

  1. Select or create a Cloud Platform project
  2. Enable billing for your project
  3. Enable the Google Cloud Dataproc API
  4. Setup Authentication

Installation example

Locally

To try is locally for development purposes. From the root folder:

chmod +x deploy_local_example.sh
./deploy_local_example.sh <YOU_PROJECT_ID> <YOUR_GCS_CONFIG_LOCATIONS> <YOUR_AUTHENTICATED_EMAIL>

The script will start a local container image and authenticate it using your local credentials.

Note: Although you can try the Dataproc Spawner image locally, you might run into networking communication problems.

Google Compute Engine

To try it out in the Cloud, the quickest way is to to use a test Compute Engine instance. The following takes you through the process.

  1. Set your working project

    PROJECT_ID=<YOUR_PROJECT_ID>
    VM_NAME=vm-spawner
  2. Run the example script which:

    a. Creates a Dockerfile b. Creates a jupyter_config.py example file that uses a dummy authenticator. c. Deploy a Docker image of the JupyterHub spawner in Google Container Registry d. Create a container-based Compute Engine e. Returns the IP of the instance that runs JupyterHub.

    bash deploy_gce_example.sh ${PROJECT_ID} ${VM_NAME}
  3. After the script finishes, you should see an IP displayed. You can use that IP to access your setup at <IP>:8000. You might have to wait for a few minutes until the container is deployed on the instance.

Troubleshooting

To troubleshoot

  1. ssh into the VM:

    gcloud compute ssh ${VM_NAME}
  2. From the VM console, install some useful tools:

    apt-get update
    apt-get install vim
  3. From the VM console, you can:

    • List the running containers with docker ps
    • Display container logs docker logs -f <CONTAINER_ID>
    • Execute code in the container docker exec -it <CONTAINER_ID> /bin/bash
    • Restart the container for changes to take effect docker restart <CONTAINER_ID>

Notes

  • DataprocSpawner defaults to port 12345, the port can be set within jupyterhub_config.py. More info in JupyterHub's jupyterhub_documentation.

    c.Spawner.port = {port number}

  • The region default is us-central1 for Dataproc clusters. The zone default is us-central1-a. Using global is currently unsupported. To change region, pick a region and zone from this list and include the following lines in jupyterhub_config.py:

    .. code-block:: console

    c.DataprocSpawner.region = '{region}' c.DataprocSpawner.zone = '{zone that is within the chosen region}'

Next

Disclaimer

This is not an official Google product.