Jupyter to Azure Databricks connector

Connect to Azure Databricks clusters from Jupyter notebooks.

Installation

1. Clone this repo

$ git clone https://github.com/lorenzo-romanelli/jupyter-db-connect.git

2. Create and activate a virtual environment

Remember, when creating the virtual environment you need to install either Python 2.7 or Python 3.5, depending on which version is running on the cluster.

$ cd jupyter-db-connect
$ virtualenv -p /usr/bin/python3.5 env
$ cd env
$ source bin/activate

3. Install requirements

$ pip install -r requirements.txt

Configure your Databricks connection

Run the following command:

$ databricks-connect configure

And follow the onscreen instructions. You will be asked to fill in some config values:

Databricks Host, e.g. https://westeurope.azuredatabricks.net
Databricks Token (see here)
Cluster ID (see here)
Org ID (see ?o=orgId in your Databricks URL)
Port: here you should use port 8787, since we are working on Azure

Run tests

Run the following command to test your setup is up and working:

$ databricks-connect test

If the remote Databricks cluster is not running, it will start automatically (it might take some time).

Enjoy!

Run the following command:

$ jupyter notebook

And from your browser navigate to localhost:8888.

Example notebook

The notebook test notebook.ipynb contains some example commands to set up your Databricks environment from Jupyter itself, by defining the SparkContext, as well as dbutils and sqlContext.

On Databricks, they are automatically defined, but they need to be specified by hand in your local Jupyter notebook.