Azure/azure-databricks-operator

Failed to create Dcluster object

Closed this issue · 3 comments

I've just installed v0.30 and attempted to create a DCluster using the config/samples yaml.

Kubernetes version 1.13.10

I get the following error in the operator logs:

2019-10-14T17:55:03.104Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.104Z        INFO    controllers.Dcluster    AddFinalizer for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "dcluster", "request": "kubeflow/dcluster-sample"}
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Submit for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Create cluster dcluster-sample
2019-10-14T17:55:03.128Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Dcluster","namespace":"kubeflow","name":"dcluster-sample","uid":"bf7ba102-eeab-11e9-a0ba-1e18e514b3df","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"8427"}, "reason": "Added", "message": "Object finalizer is added"}
2019-10-14T17:55:10.006Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.006Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "dcluster", "request": "kubeflow/dcluster-sample"}
2019-10-14T17:55:10.006Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.007Z        INFO    controllers.Dcluster    Refresh for kubeflow/dcluster-sample
2019-10-14T17:55:10.007Z        INFO    controllers.Dcluster    Refresh cluster dcluster-sample
2019-10-14T17:55:10.007Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Dcluster","namespace":"kubeflow","name":"dcluster-sample","uid":"bf7ba102-eeab-11e9-a0ba-1e18e514b3df","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"8443"}, "reason": "Submitted", "message": "Object is submitted"}
2019-10-14T17:55:10.704Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.704Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "dcluster", "request": "kubeflow/dcluster-sample", "error": "error when refreshing cluster: unexpected end of JSON input"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
$ k get dclusters.databricks.microsoft.com                                                                                                           NAME              AGE   CLUSTERID              STATE   NUMWORKERS
dcluster-sample   2m    1014-175509-erred163

$ k describe dclusters.databricks.microsoft.com  dcluster-sample                                                                                     Name:         dcluster-sample
Namespace:    kubeflow
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"databricks.microsoft.com/v1alpha1","kind":"Dcluster","metadata":{"annotations":{},"name":"dcluster-sample","namespace":"kub...
API Version:  databricks.microsoft.com/v1alpha1
Kind:         Dcluster
Metadata:
  Creation Timestamp:  2019-10-14T17:55:03Z
  Finalizers:
    dcluster.finalizers.databricks.microsoft.com
  Generation:        2
  Resource Version:  8443
  Self Link:         /apis/databricks.microsoft.com/v1alpha1/namespaces/kubeflow/dclusters/dcluster-sample
  UID:               bf7ba102-eeab-11e9-a0ba-1e18e514b3df
Spec:
  Autoscale:
    max_workers:  5
    min_workers:  2
  cluster_name:   dcluster-sample
  node_type_id:   Standard_D3_v2
  spark_version:  5.3.x-scala2.11
Status:
  cluster_info:
    cluster_cores:  0
    cluster_id:     1014-175509-erred163
Events:
  Type    Reason     Age    From                 Message
  ----    ------     ----   ----                 -------
  Normal  Added      3m40s  dcluster-controller  Object finalizer is added
  Normal  Submitted  3m33s  dcluster-controller  Object is submitted

Looking into this a bit deeper, I believe that this is caused by a bug in the SDK: https://github.com/xinsnake/databricks-sdk-golang/issues/2

@Azadehkhojandi This seems to still be happening. The cluster gets created but the CRD status has minimal cluster_info and there is the error in the operator logs. (Thanks to @storey247 for testing!)

Closing this as #107 merges the fix in the SDK