Reconciler error when refreshing a Run that has been deleted in Databricks

Question

Reconciler error when refreshing a Run that has been deleted in Databricks

magencio opened this issue 5 years ago · 0 comments

I'm getting a reconciler error in the operator when it tries to refresh the status of a Run that has already been deleted in Databricks via Databricks API.

The following test reproduces the error:

Create Run in k8s with kubectl and wait for it to appear in Databricks.
Before the Run finishes in Databricks, cancel it via Databricks API (e.g. https://westeurope.azuredatabricks.net/api/2.0/jobs/runs/cancel). You cannot delete an active run.
Immediately delete the Run in Databricks via Databricks API (e.g. https://westeurope.azuredatabricks.net/api/2.0/jobs/runs/delete).
Wait for the next refresh of the Run in the operator. Operator will fail with this error as the Run cannot be found in Databricks:

2020-01-30T18:32:27.256Z        INFO    controllers.Run Refreshing run run-sample
2020-01-30T18:32:27.382Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: Response from server (500) {\"error_code\":\"INTERNAL_ERROR\",\"message\":\"\"}"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2020-01-30T18:32:27.383Z        DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"Run","namespace":"default","name":"run-sample","uid":"747c08f3-4413-11ea-8b01-66e9e40fe22a","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"21659807"}, "reason": "Refreshing object", "message": "Failed to refresh object: Response from server (500) {\"error_code\":\"INTERNAL_ERROR\",\"message\":\"\"}"}

This error will be raised in every reconcile loop after that.

Delete Run in k8s with kubectl. Run gets deleted just fine. No more errors are raised.

So summing up, we will get this error in every reconcile loop of the operator if we delete the Run with Databricks API before the operator tries to refresh its status. Deleting the Run in k8s after it has been deleted in Databricks will work fine, though.