PowerShell Tools for Deploying & Managing Databricks Solutions in Azure. These commandlets help you build continuous delivery pipelines and better source control for your scripts.
Supports Windows PowerShell 5 and Powershell Core 6.1+. We generally recommend you use PowerShell Core where possible (it's faster to load modules and downloading large DBFS files may fail in older versions).
See the Wiki for command help.
Here is some more detail on use cases for these https://datathirst.net/blog/2019/1/18/powershell-for-azure-databricks
https://www.powershellgallery.com/packages/azure.databricks.cicd.tools
Install-Module -Name azure.databricks.cicd.tools -Scope CurrentUser
Followed by:
Import-Module -Name azure.databricks.cicd.tools
To upgrade from a previous version
Update-Module -Name azure.databricks.cicd.tools
Please note the use of AAD Authentication and Service Principals with Databricks is in Preview only. These commands are liable to change and/or break at any time.
You must install version 2 or higher of azure.databricks.cicd.tools
Install-Module -Name azure.databricks.cicd.tools -MinimumVersion 2.0.39 -Force
Import-Module -Name azure.databricks.cicd.tools -MinimumVersion 2.0.39 -Force
If you receive a message that -AllowPrerelease is an unknown parameter please run Install-Module PowershellGet -Force
as Administrator and restart your PowerShell session.
Create a new Service Principal, you will need the following:
- ApplicationId (also known as ClientId)
- Secret Key
- TenantId
Make the Service Principal a Contributor on your Databricks Workspace using the Access Control (IAM) blade in the portal.
You must first create a connection to Databricks. Currently there are three methods supported:
- Provide the ApplicationId/Secret and the Databricks OrganisationId for your workspace - known as DIRECT
- This is the o=1234567890 number in the URL when you use your workspace
- Provide the ApplicationId/Secret and the SubscriptionID, Resource Group Name & Workspace Name - known as MANAGEMENT
- Provide a Bearer token connect as your own user account - known as BEARER
- This is the classic method and not recommended for automated processes
- It is however still useful for running adhoc commands from your desktop
NOTE: The first time a service principal connects it must use the MANAGEMENT method as this provisions the service principal in the workspace. Therefore after you can use the DIRECT method. Without doing this first you will receive a 403 Unauthorized response on all commands.
DIRECT:
Connect-Databricks -Region "westeurope" -ApplicationId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3" -Secret "myPrivateSecret" `
-DatabricksOrgId 1234567 `
-TenantId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3"
MANAGEMENT:
Connect-Databricks -Region "westeurope" -ApplicationId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3" -Secret "myPrivateSecret" `
-ResourceGroupName "MyResourceGroup" `
-SubscriptionId "9a686882-0e5b-4edb-cd49-cf1f1e7f34d9" `
-WorkspaceName "workspaceName" `
-TenantId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3"
You can also use this command to connect using the Bearer token so that you do not have to provide them on every command (like you did prior to version 2).
BEARER:
Connect-Databricks -BearerToken "dapi1234567890" -Region "westeurope"
You can now execute the commands as required without providing further authication in this PowerShell session:
Get-DatabricksClusters
You can continue to execute commands using the bearer token in every request (this will override the session connection (if any)):
Get-DatabricksClusters -BearerToken "dapi1234567890" -Region "westeurope"
This is to provide backwards compatibility with version 1 only.
For a full list of commands with help please see the Wiki.
- Set-DatabricksSecret
- Add-DatabricksSecretScope
Deploys a Secret value to Databricks, this can be a key to a storage account or a password etc. The secret must be created within a scope which will be created for you if it does not exist.
Please note that the Databricks REST API currently does not support the adding of Key Vault backed scopes so these commands cannot either.
The following commands exist:
- Get-DatabricksClusters - Returns a list of all clusters in your workspace
- New-DatabricksCluster - Creates/Updates a cluster
- Start-DatabricksCluster
- Stop-DatabricksCluster
- Update-DatabricksClusterResize - Modify the number of scale workers
- Remove-DatabricksCluster - Deletes your cluster
- Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc)
- Get-DatabricksSparkVersions - returns a list of valid versions
Please see the scripts of the parameters. Examples are available in the Tests folder.
These have been designed with CI/CD in mind - ie they should all be idempotent.
- Add-DatabricksDBFSFile - Upload a file or folder to DBFS
- Remove-DatabricksDBFSItem - Delete a file or folder
- Get-DatabricksDBFSFolder - List folder contents
The Add-DatabricksDBFSFile can be used as part of a CI/CD pipeline to upload your source code to DBFS, or dependant libraries. You can also use it to deploy initialisation scripts for your clusters.
Pull down a folder of scripts from your Databricks workspace so that you can commit the files to your Git repo. It is recommended that you set the OutputPath to be inside your Git repo.
Parameters
-ExportPath: The folder inside Databricks you would like to clone. Eg /Shared/MyETL. Must start /
-LocalOutputPath: The local folder to clone the files to. Ideally inside a repo. Can be qualified or relative.
Deploy a folder of scripts from a local folder (Git repo) to a specific folder in your Databricks workspace.
Parameters
-LocalPath: The local folder containing the scripts to deploy. Subfolders will also be deployed.
-DatabricksPath: The folder inside Databricks you would like to deploy into. Eg /Shared/MyETL. Must start /
- Add-DatabricksNotebookJob - Schedule a job based on a Notebook.
- Add-DatabricksPythonJob - Schedule a job based on a Python script (stored in DBFS).
- Add-DatabricksJarJob - Schedule a job based on a Jar (stored in DBFS).
- Add-DatabricksSparkSubmitJob - Schedule a job based on a spark-submit command.
- Remove-DatabricksJob
- Add-DatabricksLibrary
- Get-DatabricksLibraries
This command can be used for calling the API directly just lookup the syntax (https://docs.databricks.com/dev-tools/api/latest/index.html)
- Invoke-DatabricksAPI
See the Wiki for help on the commands. You can also see more examples in the tests folder.
Deployment tasks exist here: https://marketplace.visualstudio.com/items?itemName=DataThirstLtd.databricksDeployScriptsTasks
Note that not all commandlets are available as tasks. Instead you may want to import the module and create PowerShell scripts that use these.
Contributions are welcomed! Please create a pull request with changes/additions.
For any requests on new features please check the Databricks REST API documentation to see if it is supported first.