/azure-databricks-cicd-tools

Tools for Deploying Databricks Solutions in Azure

Primary LanguagePowerShellGNU General Public License v3.0GPL-3.0

Build status PSGalleryStatus

azure.databricks.cicd.tools

PowerShell Tools for Deploying & Managing Databricks Solutions in Azure. These commandlets help you build continuous delivery pipelines and better source control for your scripts.

Overview

Supports Windows PowerShell 5 and Powershell Core 6.1+. We generally recommend you use PowerShell Core where possible (it's faster to load modules and downloading large DBFS files may fail in older versions).

See the Wiki for command help.

Here is some more detail on use cases for these https://datathirst.net/blog/2019/1/18/powershell-for-azure-databricks

Install-Module

https://www.powershellgallery.com/packages/azure.databricks.cicd.tools

Install-Module -Name azure.databricks.cicd.tools -Scope CurrentUser

Followed by:

Import-Module -Name azure.databricks.cicd.tools

To upgrade from a previous version

Update-Module -Name azure.databricks.cicd.tools

Connecting

Using AAD Service Principals (PREVIEW)

Please note the use of AAD Authentication and Service Principals with Databricks is in Preview only. These commands are liable to change and/or break at any time.

Install

You must install version 2 or higher of azure.databricks.cicd.tools

Install-Module -Name azure.databricks.cicd.tools -MinimumVersion 2.0.39 -Force
Import-Module -Name azure.databricks.cicd.tools -MinimumVersion 2.0.39 -Force

If you receive a message that -AllowPrerelease is an unknown parameter please run Install-Module PowershellGet -Force as Administrator and restart your PowerShell session.

Create Service Principal

Create a new Service Principal, you will need the following:

  • ApplicationId (also known as ClientId)
  • Secret Key
  • TenantId

Make the Service Principal a Contributor on your Databricks Workspace using the Access Control (IAM) blade in the portal.

Connect-Databricks

You must first create a connection to Databricks. Currently there are three methods supported:

  • Provide the ApplicationId/Secret and the Databricks OrganisationId for your workspace - known as DIRECT
    • This is the o=1234567890 number in the URL when you use your workspace
  • Provide the ApplicationId/Secret and the SubscriptionID, Resource Group Name & Workspace Name - known as MANAGEMENT
  • Provide a Bearer token connect as your own user account - known as BEARER
    • This is the classic method and not recommended for automated processes
    • It is however still useful for running adhoc commands from your desktop

NOTE: The first time a service principal connects it must use the MANAGEMENT method as this provisions the service principal in the workspace. Therefore after you can use the DIRECT method. Without doing this first you will receive a 403 Unauthorized response on all commands.

Examples

DIRECT:

Connect-Databricks -Region "westeurope" -ApplicationId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3" -Secret "myPrivateSecret" `
            -DatabricksOrgId 1234567 `
            -TenantId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3"

MANAGEMENT:

Connect-Databricks -Region "westeurope" -ApplicationId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3" -Secret "myPrivateSecret" `
            -ResourceGroupName "MyResourceGroup" `
            -SubscriptionId "9a686882-0e5b-4edb-cd49-cf1f1e7f34d9" `
            -WorkspaceName "workspaceName" `
            -TenantId "8a686772-0e5b-4cdb-ad19-bf1d1e7f89f3"

You can also use this command to connect using the Bearer token so that you do not have to provide them on every command (like you did prior to version 2).

BEARER:

Connect-Databricks -BearerToken "dapi1234567890" -Region "westeurope"

You can now execute the commands as required without providing further authication in this PowerShell session:

Get-DatabricksClusters

Legacy Bearer Token Method

You can continue to execute commands using the bearer token in every request (this will override the session connection (if any)):

 Get-DatabricksClusters -BearerToken "dapi1234567890" -Region "westeurope"

This is to provide backwards compatibility with version 1 only.

Commands

For a full list of commands with help please see the Wiki.

Secrets

  • Set-DatabricksSecret
  • Add-DatabricksSecretScope

Deploys a Secret value to Databricks, this can be a key to a storage account or a password etc. The secret must be created within a scope which will be created for you if it does not exist.

Please note that the Databricks REST API currently does not support the adding of Key Vault backed scopes so these commands cannot either.

Cluster Management

The following commands exist:

  • Get-DatabricksClusters - Returns a list of all clusters in your workspace
  • New-DatabricksCluster - Creates/Updates a cluster
  • Start-DatabricksCluster
  • Stop-DatabricksCluster
  • Update-DatabricksClusterResize - Modify the number of scale workers
  • Remove-DatabricksCluster - Deletes your cluster
  • Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc)
  • Get-DatabricksSparkVersions - returns a list of valid versions

Please see the scripts of the parameters. Examples are available in the Tests folder.

These have been designed with CI/CD in mind - ie they should all be idempotent.

DBFS

  • Add-DatabricksDBFSFile - Upload a file or folder to DBFS
  • Remove-DatabricksDBFSItem - Delete a file or folder
  • Get-DatabricksDBFSFolder - List folder contents

The Add-DatabricksDBFSFile can be used as part of a CI/CD pipeline to upload your source code to DBFS, or dependant libraries. You can also use it to deploy initialisation scripts for your clusters.

Notebooks

Export-DatabricksFolder

Pull down a folder of scripts from your Databricks workspace so that you can commit the files to your Git repo. It is recommended that you set the OutputPath to be inside your Git repo.

Parameters

-ExportPath: The folder inside Databricks you would like to clone. Eg /Shared/MyETL. Must start /
-LocalOutputPath: The local folder to clone the files to. Ideally inside a repo. Can be qualified or relative.

Import-DatabricksFolder

Deploy a folder of scripts from a local folder (Git repo) to a specific folder in your Databricks workspace.

Parameters

-LocalPath: The local folder containing the scripts to deploy. Subfolders will also be deployed.
-DatabricksPath: The folder inside Databricks you would like to deploy into. Eg /Shared/MyETL. Must start /

Jobs

  • Add-DatabricksNotebookJob - Schedule a job based on a Notebook.
  • Add-DatabricksPythonJob - Schedule a job based on a Python script (stored in DBFS).
  • Add-DatabricksJarJob - Schedule a job based on a Jar (stored in DBFS).
  • Add-DatabricksSparkSubmitJob - Schedule a job based on a spark-submit command.
  • Remove-DatabricksJob

Libraries

  • Add-DatabricksLibrary
  • Get-DatabricksLibraries

Missing Commands/Bugs

This command can be used for calling the API directly just lookup the syntax (https://docs.databricks.com/dev-tools/api/latest/index.html)

  • Invoke-DatabricksAPI

Examples

See the Wiki for help on the commands. You can also see more examples in the tests folder.

Misc

VSTS/Azure DevOps

Deployment tasks exist here: https://marketplace.visualstudio.com/items?itemName=DataThirstLtd.databricksDeployScriptsTasks

Note that not all commandlets are available as tasks. Instead you may want to import the module and create PowerShell scripts that use these.

Contribute

Contributions are welcomed! Please create a pull request with changes/additions.

Requests

For any requests on new features please check the Databricks REST API documentation to see if it is supported first.