- Example of running pytest integration tests on Databricks with wheel using
run submit
API endpoint. - Demonstrates the steps that you need to integrate into your CI/CD test build.
Setup
- Create conda environment
- Create JSON spec file for the test run
- Uses runs submit API endpoint
- Uses SparkPythonTask
- Copy the run_tests.py to DBFS to serve as your main program for SparkPythonTask
Sample test project
Steps to run tests
- Build the wheel with your business logic and tests
- Copy the wheel to DBFS
- Launch test run with
run submit
using the JSON spec file - Poll the run until it is in TERMINATED or INTERNAL ERROR state
- Download test results file junit.xml
- Display URI to driver logs
- Manual - execute above steps manually
- Automated - Python workflow scripts to automate above steps
Databricks Workspace and Authentication
Specifying your workspace URI and credentials is based on the Databricks CLI. See Set up authentication in Databricks documentation.
Conda environment
Create a conda environment and install databricks-cli
PyPI package.
See conda.yaml.
conda env create conda.yaml
source activate databricks-tests
Configure JSON run spec file
Copy run_submit.json.template and change {DBFS_DIR}
to point to your DBFS location.
cd example
sed -e "s;{DBFS_DIR};dbfs:/jobs/myapp;" < run_submit.json.template > run_submit.json
cd example
Copy test harness main program to DBFS
See run_tests.py.
databricks fs cp ../test-harness/databricks_test_harness/run_tests.py dbfs:/jobs/myapp --overwrite
Build the wheel with tests
python setup.py bdist_wheel
Copy the wheel to DBFS
databricks fs cp dist/databricks_tests_example-0.0.1-py3-none-any.whl dbfs:/jobs/myapp
Launch run to execute tests on Databricks cluster
databricks runs submit --json-file run_submit.json
{
"run_id": 2404336
}
Poll the run until it is done
Execute the following command repeatedly until the run is in a TERMINATED or INTERNAL ERROR state
databricks runs get --run-id 2404336
"state": {
"life_cycle_state": "PENDING",
"state_message": "Waiting for cluster"
},
"state": {
"life_cycle_state": "RUNNING",
"state_message": "In run"
},
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
Check the output logs
Open up the run_page_url
in your browser to view the logs in the Spark UI.
"run_page_url": "https://demo.cloud.databricks.com#job/35950/run/1",
Download the test results
See junit.xml.
databricks fs cp dbfs:/jobs/myapp/junit.xml .
<?xml version="1.0" ?>
<testsuites>
<testsuite
errors="0"
failures="0"
hostname="0716-190309-boll22-10-0-249-156"
name="pytest"
skipped="0"
tests="2"
time="8.975"
timestamp="2020-07-16T19:06:17.155587">
<testcase classname="test_simple.TestSimple"
file="../python3/lib/python3.7/site-packages/tests/test_simple.py"
line="3"
name="test_simple"
time="0.001"/>
<testcase classname="test_spark.TestSpark"
file="../python3/lib/python3.7/site-packages/tests/test_spark.py"
line="6"
name="test_multiply"
time="8.829"/>
</testsuite>
</testsuites>
cd example
pip install -e ../test-harness
This is a one-time initialization step.
The configure
program does the following:
- Creates DBFS test directory
- Copies the test harness program run_test.py to DBFS
python -m databricks_test_harness.configure \
--test_dir dbfs:/jobs/myapp
python setup.py bdist_wheel
databricks fs cp dist/databricks_tests_example-0.0.1-py3-none-any.whl dbfs:/jobs/myapp --overwrite
Executes the run_databricks_tests.py program which:
- Automates the individual API calls that comprise a test run
- Launches the test run
- Polls until the run is in TERMINATED or INTERNAL_ERROR state
- Downloads the junit.xml test results file
- Displays URI to driver logs
python -u -m databricks_test_harness.run_databricks_tests \
--json_spec_file run_submit.json \
--sleep_time 5
run_id: 2404423 state: {'life_cycle_state': 'PENDING', 'state_message': ''}
run_id: 2404423 state: {'life_cycle_state': 'PENDING', 'state_message': 'Waiting for cluster'}
cluster_id: 0727-174421-hag299
run_id: 2404423 state: {'life_cycle_state': 'PENDING', 'state_message': 'Waiting for cluster'}
. . .
run_id: 2404423 state: {'life_cycle_state': 'PENDING', 'state_message': 'Installing libraries'}
. . .
run_id: 2404423 state: {'life_cycle_state': 'RUNNING', 'state_message': 'In run'}
. . .
run_id: 2404423 state: {'life_cycle_state': 'TERMINATING', 'result_state': 'SUCCESS', 'state_message': 'Terminating the spark cluster'}
run_id: 2404423 state: {'life_cycle_state': 'TERMINATED', 'result_state': 'SUCCESS', 'state_message': ''}
Run results: https://demo.cloud.databricks.com#job/36040/run/1
Check test results.
cat junit.xml
Check driver results.
Open https://demo.cloud.databricks.com#job/36040/run/1
in your browser.