This is a sample project for Databricks, generated via cookiecutter.
While using this project, you need Python 3.X and pip
or conda
for package management.
- Instantiate a local Python environment via a tool of your choice. This example is based on
conda
, but you can use any environment management tool:
conda create -n ptbwa_dbx python=3.9
conda activate ptbwa_dbx
- If you don't have JDK installed on your local machine, install it (in this example we use
conda
-based installation):
conda install -c conda-forge openjdk=11.0.15
- Install project locally (this will also install dev requirements):
pip install -e ".[local,test]"
For unit testing, please use pytest
:
pytest tests/unit --cov
Please check the directory tests/unit
for more details on how to use unit tests.
In the tests/unit/conftest.py
you'll also find useful testing primitives, such as local Spark instance with Delta support, local MLflow and DBUtils fixture.
There are two options for running integration tests:
- On an all-purpose cluster via
dbx execute
- On a job cluster via
dbx launch
For quicker startup of the job clusters we recommend using instance pools (AWS, Azure, GCP).
For an integration test on all-purpose cluster, use the following command:
dbx execute <workflow-name> --cluster-name=<name of all-purpose cluster>
To execute a task inside multitask job, use the following command:
dbx execute <workflow-name> \
--cluster-name=<name of all-purpose cluster> \
--job=<name of the job to test> \
--task=<task-key-from-job-definition>
For a test on a job cluster, deploy the job assets and then launch a run from them:
dbx deploy <workflow-name> --assets-only
dbx launch <workflow-name> --from-assets --trace
dbx
expects that cluster for interactive execution supports%pip
and%conda
magic commands.- Please configure your workflow (and tasks inside it) in
conf/deployment.yml
file. - To execute the code interactively, provide either
--cluster-id
or--cluster-name
.
dbx execute <workflow-name> \
--cluster-name="<some-cluster-name>"
Multiple users also can use the same cluster for development. Libraries will be isolated per each user execution context.
To start working with your notebooks from a Repos, do the following steps:
- Add your git provider token to your user settings in Databricks
- Add your repository to Repos. This could be done via UI, or via CLI command below:
databricks repos create --url <your repo URL> --provider <your-provider>
This command will create your personal repository under /Repos/<username>/ptbwa_dbx
.
3. Use git_source
in your job definition as described here
Please set the following secrets or environment variables for your CI provider:
DATABRICKS_HOST
DATABRICKS_TOKEN
- To trigger the CI pipeline, simply push your code to the repository. If CI provider is correctly set, it shall trigger the general testing pipeline
- To trigger the release pipeline, get the current version from the
ptbwa_dbx/__init__.py
file and tag the current code version:
git tag -a v<your-project-version> -m "Release tag for version <your-project-version>"
git push origin --tags