[Feature] Add support for cluster execution of arbitrary notebook code

Question

[Feature] Add support for cluster execution of arbitrary notebook code

Opened this issue 7 months ago · 1 comments

Loving the extension! Huge improvement for using engineering best practices and integrating Databricks compute with the larger ecosystem of locally executed tools.

I'd like to see support for executing arbitrary notebook code (not just Spark calls) on remote Databricks clusters. This would allow local developers to seamlessly take advantage of Databricks compute for heavy, non-Spark workflows (model training for example).

Two approaches come to mind:

Pipe commands to the Command Execution API, possibly using a local Jupyter Kernel to interop between the notebook environment and Databricks.
Connect to the driver node Jupyter Kernel over SSH

Command Execution API

The Databricks Power Tools extension solves this by using the Command Execution API.

I don't know Rust, but as far as I can tell this article Connecting Jupyter with Databricks aims to wrap the API with a local Jupyter kernel (which would allow connections to any Jupyter client).

SSH

This seems the most straightforward in terms of net new code required. Also seems identical to the deprecated (for security purposes?) jupyterlab-integration.

Answer 1 · 2024-05-02T22:35:09.000Z

+1 on this

@kartikgupta-db - If you have a rough understanding of what would need to change for this to be implemented and would accept a PR, I'd be willing to have a go. Just need some guidance on getting started