canonical/seldon-core-operator

Error deploying the SeldonDeployment in namespace other than kubeflow

Closed this issue · 3 comments

Trying to deploy a SeldonDeployment:

$ microk8s kubectl apply -f model-mlflow-local.yaml -n kubeflow
seldondeployment.machinelearning.seldon.io/mlflow created

$ microk8s kubectl apply -f model-mlflow-local.yaml -n admin
Error from server (InternalError): error when creating "model-mlflow-local.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.kubeflow.svc:4443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial tcp 10.152.183.150:4443: connect: connection refused

I have the same secret in both of the namespaces:

$ microk8s kubectl get secrets -A | grep seldon-init-container-secret
admin              seldon-init-container-secret                          Opaque                      6      114m
kubeflow           seldon-init-container-secret                          Opaque                      6      114m

yaml file (requires changing the modelUri)

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/71/8476095066fd43af8ae2a6f1511044df/artifacts/model
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: wine-super-model
    replicas: 1

Deployment was done using kubeflow-lite bundle + mlflow (stable channel). RBAC is enabled on microk8s.

I think this is because we haven't added these permissions to the aggregate roles applied to each Kubeflow user. We had a similar problem for other components (Katib, kfp, ...) when moving to multi-user and solved it with the kubeflow-roles-operator. Likely we can fix this the same way.

Nevermind, my first guess was completely wrong.

Instead I believe the problem might be that for namespaces with the label serving.kubeflow.org/inferenceservice: enabled, they trigger a webhook that goes through the seldon-webhook-service and one of that webhook's selectors is control-plane: seldon-controller-manager, but the pod providing the webhook does not have that label. Looking into what we should be labelling things now, but as a quick fix either of these works:

  • edit the seldon-webhook-service and delete that control-plane label
  • remove the serving.kubeflow.org/inferenceservice: enabled label from the admin namespace

Looking into what the webhook is supposed to actually do, etc, to see how we should configure things to fix permanently.

Turns out this is actually fixed in edge by #14, but not pushed to stable. I'll close this but if you hit it again please reopen