A repository demonstrating the use of Databricks Spark clusters from within the ArcGIS Insights kernel gateway.
Scaling business intelligence is a challenge. Combining Spark managed by Databricks with ArcGIS Insights can make it easier. You can read more about the process and analysis here.
Want the same thing for ArcGIS Pro Notebooks? Check out @mraad's repository here.
- conda installed and accessible in your path environment variable docs
- java 8 installed and selected as the default (installation and setup varies by platform)
- ArcGIS Insights Desktop client download here
- A Databricks Subscription free trial
conda env create -f insights-dbc.yml
conda activate insights-dbc
When that's done, ensure that databricks-connect was successfully installed and respects your pyspark path:
databricks-connect get-spark-home
If that fails with CommandNotFound
, run:
pip install databricks-connect==6.5 # replace 6.5 with your cluster version
Make sure you have the following available from your Databricks environment before moving on:
- A Spark cluster, with at least the following configuration parameters:
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787 # 8787 required for Azure, can be other for AWS
- Workspace URL
- Access Token
- Cluster ID
- Port
You can find details on these configs in the Databricks Connect docs
Next, interactively configure the Databricks environment:
databricks-connect configure
And, test that configuration:
databricks-connect test # Warning: this will start your Databricks cluster if it isn't already up
If you passed all the tests, you can spin up the kernel gateway for Insights with:
chmod +x start_kernel.sh && ./start_kernel.sh # mac/linux/wsl only
Or run the command yourself:
jupyter kernelgateway --KernelGatewayApp.ip=0.0.0.0 \
--KernelGatewayApp.port=9999 \
--KernelGatewayApp.allow_origin='*' \
--KernelGatewayApp.allow_credentials='*' \
--KernelGatewayApp.allow_headers='*' \
--KernelGatewayApp.allow_methods='*' \
--JupyterWebsocketPersonality.list_kernels=True
Your terminal should now be reporting there is an open kernel gateway at 0.0.0.0:9999
.
From ArcGIS Insights, launch the scripting console from the top right corner and enter in 0.0.0.0:9999
as the URL then click connect.
Congrats! You now have an ArcGIS Insights kernel that will remotely execute Spark jobs on your Databricks cluster.
The insights-spark-well-clusters notebook is an export from Insights that demonstrates what I did for my first test of this environment. To reproduce it, add this data to your DBFS. It's HIFLD's dataset for North American Oil and Gas Wells and the notebook trains a very simple k-means model to perform clustering on that dataset.
Apache BSD 2.0 @ Samuel Cook, 2020
Feel free to open an issue or PR, we welcome contributions of all types. :)
- [] Docker image and compose.yml
- [] New visualization examples