canonical/resource-dispatcher

MLFlow credentials will not update on existing or new Kubeflow Notebooks

Closed this issue · 8 comments

Versions

Kubernetes platform: Microk8s (v1.24.16)

Juju Agents: 2.9.43
Juju CLI: 2.9.44-ubuntu-amd64

resource-dispatcher: edge (rev 78)
Kubeflow: 1.7/stable
MLFlow: 2.1/edge

Rest of charm versions visible on juju status

Reproduce

The integration between mlflow and kubeflow notebooks was working for me until I decided to update mlflow-minio credentials

juju config mlflow-minio secret-key=miniominio

juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
  UnitId: mlflow-server/0
  id: "4"
  results:
    access-key: minio
    secret-access-key: miniominio
  status: completed
  timing:
    completed: 2023-08-07 05:00:50 +0000 UTC
    enqueued: 2023-08-07 05:00:47 +0000 UTC
    started: 2023-08-07 05:00:48 +0000 UTC

Logs

Now testing the S3 credentials on existing, as well as on new notebooks I would still be getting the old secre-key

# Notebook POV
!aws --endpoint-url $MLFLOW_S3_ENDPOINT_URL s3 mb s3://mlflow
# Output
make_bucket failed: s3://mlflow An error occurred (SignatureDoesNotMatch) when calling the CreateBucket operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.
# Terminal POV inside a Jupyter Notebook
# It is printing the old access-secret
env | grep AWS
AWS_SECRET_ACCESS_KEY=9TJTFVPTXBZ8QB1YZBP0JUYOIID6XC
AWS_ACCESS_KEY_ID=minio

I tried:
A. Removing and adding secret relation, but still the same result

juju remove-relation mlflow-server:secrets resource-dispatcher:secrets
juju add-relation mlflow-server:secrets resource-dispatcher:secrets

B. Removing the application and redeploying, but still the same result

juju deploy resource-dispatcher --channel edge --trust
juju relate mlflow-server:secrets resource-dispatcher:secrets
juju relate mlflow-server:pod-defaults resource-dispatcher:pod-defaults

I could not find any workaround to it so far
The only thing I could do was rollback the secret-key change
Which, I am not sure if it is intended but after re-setting to no value I got the same secret-key I had before changing its value

juju config mlflow-minio --reset secret-key
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
  UnitId: mlflow-server/0
  id: "6"
  results:
    access-key: minio
    secret-access-key: 9TJTFVPTXBZ8QB1YZBP0JUYOIID6XC
  status: completed
  timing:
    completed: 2023-08-07 05:21:08 +0000 UTC
    enqueued: 2023-08-07 05:21:02 +0000 UTC
    started: 2023-08-07 05:21:07 +0000 UTC
juju status
Model     Controller          Cloud/Region        Version  SLA          Timestamp
kubeflow  microk8s-localhost  microk8s/localhost  2.9.43   unsupported  05:31:10Z

App                        Version                  Status  Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active      1  admission-webhook        1.7/stable      205  10.152.183.52   no       
argo-controller            res:oci-image@669ebd5    active      1  argo-controller          3.3/stable      236                  no       
argo-server                res:oci-image@576d038    active      1  argo-server              3.3/stable      185                  no       
dex-auth                                            active      1  dex-auth                 2.31/stable     224  10.152.183.51   no       
istio-ingressgateway                                active      1  istio-gateway            1.16/stable     551  10.152.183.134  no       
istio-pilot                                         active      1  istio-pilot              1.16/stable     551  10.152.183.201  no       
jupyter-controller         res:oci-image@1167186    active      1  jupyter-controller       1.7/stable      607                  no       
jupyter-ui                                          active      1  jupyter-ui               1.7/stable      534  10.152.183.96   no       
katib-controller           res:oci-image@111495a    active      1  katib-controller         0.15/stable     282  10.152.183.197  no       
katib-db                   8.0.32-0ubuntu0.22.04.2  active      1  mysql-k8s                8.0/stable       75  10.152.183.158  no       Primary
katib-db-manager                                    active      1  katib-db-manager         0.15/stable     253  10.152.183.33   no       
katib-ui                                            active      1  katib-ui                 0.15/stable     267  10.152.183.84   no       
kfp-api                                             active      1  kfp-api                  2.0/stable      540  10.152.183.125  no       
kfp-db                     8.0.32-0ubuntu0.22.04.2  active      1  mysql-k8s                8.0/stable       75  10.152.183.56   no       Primary
kfp-persistence            res:oci-image@516e6b8    active      1  kfp-persistence          2.0/stable      500                  no       
kfp-profile-controller     res:oci-image@b26a126    active      1  kfp-profile-controller   2.0/stable      478  10.152.183.81   no       
kfp-schedwf                res:oci-image@68cce0a    active      1  kfp-schedwf              2.0/stable      515                  no       
kfp-ui                     res:oci-image@ae72602    active      1  kfp-ui                   2.0/stable      504  10.152.183.101  no       
kfp-viewer                 res:oci-image@c0f065d    active      1  kfp-viewer               2.0/stable      517                  no       
kfp-viz                    res:oci-image@3de6f3c    active      1  kfp-viz                  2.0/stable      476  10.152.183.64   no       
knative-eventing                                    active      1  knative-eventing         1.8/stable      224  10.152.183.59   no       
knative-operator                                    active      1  knative-operator         1.8/stable      199  10.152.183.30   no       
knative-serving                                     active      1  knative-serving          1.8/stable      224  10.152.183.224  no       
kserve-controller                                   active      1  kserve-controller        0.10/stable     267  10.152.183.143  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.7/stable      307  10.152.183.178  no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.7/stable      269  10.152.183.161  no       
kubeflow-roles                                      active      1  kubeflow-roles           1.7/stable      113  10.152.183.43   no       
kubeflow-volumes           res:oci-image@d261609    active      1  kubeflow-volumes         1.7/stable      178  10.152.183.113  no       
metacontroller-operator                             active      1  metacontroller-operator  2.0/stable      117  10.152.183.106  no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.7/stable  186  10.152.183.226  no       
mlflow-minio               res:oci-image@1755999    active      1  minio                    ckf-1.7/edge    186  10.152.183.225  no       
mlflow-mysql               8.0.32-0ubuntu0.22.04.2  active      1  mysql-k8s                8.0/stable       75  10.152.183.185  no       Primary
mlflow-server                                       active      1  mlflow-server            latest/edge     346                  no       
oidc-gatekeeper            res:oci-image@6b720b8    active      1  oidc-gatekeeper          ckf-1.7/stable  176  10.152.183.109  no       
resource-dispatcher                                 active      1  resource-dispatcher      edge             78  10.152.183.41   no       
seldon-controller-manager                           active      1  seldon-core              1.15/stable     457  10.152.183.3    no       
tensorboard-controller     res:oci-image@c52f7c2    active      1  tensorboard-controller   1.7/stable      156  10.152.183.45   no       
tensorboards-web-app       res:oci-image@929f55b    active      1  tensorboards-web-app     1.7/stable      158  10.152.183.80   no       
training-operator                                   active      1  training-operator        1.6/stable      215  10.152.183.210  no       

Unit                          Workload  Agent  Address       Ports              Message
admission-webhook/0*          active    idle   10.1.149.233  4443/TCP           
argo-controller/0*            active    idle   10.1.150.30                      
argo-server/0*                active    idle   10.1.149.241  2746/TCP           
dex-auth/0*                   active    idle   10.1.149.215                     
istio-ingressgateway/0*       active    idle   10.1.149.250                     
istio-pilot/0*                active    idle   10.1.149.248                     
jupyter-controller/0*         active    idle   10.1.149.228                     
jupyter-ui/0*                 active    idle   10.1.149.199                     
katib-controller/0*           active    idle   10.1.149.211  443/TCP,8080/TCP   
katib-db-manager/0*           active    idle   10.1.149.238                     
katib-db/0*                   active    idle   10.1.149.207                     Primary
katib-ui/0*                   active    idle   10.1.149.254                     
kfp-api/0*                    active    idle   10.1.149.210                     
kfp-db/0*                     active    idle   10.1.149.232                     Primary
kfp-persistence/0*            active    idle   10.1.150.0                       
kfp-profile-controller/0*     active    idle   10.1.150.31   80/TCP             
kfp-schedwf/0*                active    idle   10.1.149.240                     
kfp-ui/0*                     active    idle   10.1.150.53   3000/TCP           
kfp-viewer/0*                 active    idle   10.1.149.242                     
kfp-viz/0*                    active    idle   10.1.150.39   8888/TCP           
knative-eventing/0*           active    idle   10.1.149.247                     
knative-operator/0*           active    idle   10.1.149.200                     
knative-serving/0*            active    idle   10.1.149.217                     
kserve-controller/0*          active    idle   10.1.149.234                     
kubeflow-dashboard/0*         active    idle   10.1.149.226                     
kubeflow-profiles/0*          active    idle   10.1.149.252                     
kubeflow-roles/0*             active    idle   10.1.149.219                     
kubeflow-volumes/0*           active    idle   10.1.150.51   5000/TCP           
metacontroller-operator/0*    active    idle   10.1.149.235                     
minio/0*                      active    idle   10.1.150.27   9000/TCP,9001/TCP  
mlflow-minio/0*               active    idle   10.1.150.50   9000/TCP,9001/TCP  
mlflow-mysql/0*               active    idle   10.1.149.245                     Primary
mlflow-server/0*              active    idle   10.1.149.213                     
oidc-gatekeeper/1*            active    idle   10.1.150.52   8080/TCP           
resource-dispatcher/0*        active    idle   10.1.149.243                     
seldon-controller-manager/0*  active    idle   10.1.149.225                     
tensorboard-controller/0*     active    idle   10.1.150.22   9443/TCP           
tensorboards-web-app/0*       active    idle   10.1.150.54   5000/TCP           
training-operator/0*          active    idle   10.1.149.224              

I managed to reproduce the issue with one of our notebook tests by following the steps outlined in this issue.

---------------------------------------------------------------------------
SignatureDoesNotMatch                     Traceback (most recent call last)
Cell In[6], line 2
      1 try:
----> 2     mc.make_bucket(MINIO_BUCKET)
      3 except BucketAlreadyOwnedByYou:
      4     print(f"Bucket {MINIO_BUCKET} already exists!")

File /opt/conda/lib/python3.8/site-packages/minio/api.py:336, in Minio.make_bucket(self, bucket_name, location, object_lock)
    332     dump_http(method, url, headers, response,
    333               self._trace_output_stream)
    335 if response.status != 200:
--> 336     raise ResponseError(response, method, bucket_name).get_exception()
    338 self._set_bucket_region(bucket_name, region=location)

SignatureDoesNotMatch: SignatureDoesNotMatch: message: The request signature we calculated does not match the signature you provided.

Minimal instructions to reporoduce.

  • Create notebook with single cell:
!env | grep AWS
  • Execute cell/notebook and observer AWS credentials (example key in output):
AWS_SECRET_ACCESS_KEY=WRR7QTDXG32F11HU1BTBAZS3N8FOE3
AWS_ACCESS_KEY_ID=minio
  • Update and verify minio credentials:
juju config mlflow-minio secret-key=miniominio
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
  UnitId: mlflow-server/0
  id: "22"
  results:
    access-key: minio
    secret-access-key: miniominio
  status: completed
  timing:
    completed: 2023-08-17 20:47:35 +0000 UTC
    enqueued: 2023-08-17 20:47:32 +0000 UTC
    started: 2023-08-17 20:47:33 +0000 UTC
  • Execute cell/notebook again, credentials should be updated:
AWS_SECRET_ACCESS_KEY=WRR7QTDXG32F11HU1BTBAZS3N8FOE3 # <<< should be miniominio
AWS_ACCESS_KEY_ID=minio

Looks like key is not updated in user's namespace secret. Should it be?

  • Update and verify key
  • Check secret in user namespace
juju config mlflow-minio secret-key=miniominio
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
  UnitId: mlflow-server/0
  id: "28"
  results:
    access-key: minio
    secret-access-key: miniominio
  status: completed
  timing:
    completed: 2023-08-17 20:58:51 +0000 UTC
    enqueued: 2023-08-17 20:58:38 +0000 UTC
    started: 2023-08-17 20:58:49 +0000 UTC
microk8s.kubectl -n admin get secret mlflow-server-minio-artifact -o=yaml
apiVersion: v1
data:
  AWS_ACCESS_KEY_ID: bWluaW8=
  AWS_SECRET_ACCESS_KEY: V1JSN1FURFhHMzJGMTFIVTFCVEJBWlMzTjhGT0Uz
kind: Secret
metadata:
  annotations:
    metacontroller.k8s.io/decorator-controller: kubeflow-resource-dispatcher-controller
    metacontroller.k8s.io/last-applied-configuration: '{"apiVersion":"v1","kind":"Secret","metadata":{"annotations":{"metacontroller.k8s.io/decorator-controller":"kubeflow-resource-dispatcher-controller"},"name":"mlflow-server-minio-artifact","namespace":"admin"},"stringData":{"AWS_ACCESS_KEY_ID":"minio","AWS_SECRET_ACCESS_KEY":"WRR7QTDXG32F11HU1BTBAZS3N8FOE3"}}'
  creationTimestamp: "2023-08-17T18:34:08Z"
  name: mlflow-server-minio-artifact
  namespace: admin
  resourceVersion: "31235"
  uid: 448cc9b7-17fb-4839-9772-c3b9ee2cc5fe
type: Opaque

After the updating key, the relation mlflow-server:secrets resource-dispatcher:secrets contains updated key:
From jhack:

"AWS_ACCESS_KEY_…
"minio"
"AWS_SECRET_ACCES…
"miniominio"

Candidate for fix in 1.8

misohu commented

This is expected behavior let me explain:

Resource dispatcher is responsible for distributing manifests across the namespaces with specific label. It is also responsible for updating those manifests in case the relation with resource dispatcher changes. That is what is happening. If you mlflow kubeflow plus resource dispatcher and you change the password for minio in mlflow bundle. resource dispatcher will update manifests in target namespaces. Its demonstrated also in Ivan's comment as he is pointing that the relation changes.

The error you are describing is just related how mounted secrets in Kubernetes work. Because we mount secret as environment variable to user's notebook, the environment will not automatically change on secret change (the secret in namespace changed but the notebook environment not). In order to use new environment with new value you need to restart pod.

If we want to automatically change the environment secret we would need to somehow monitor for secret change (e.g. with vault sidecar injector https://www.hashicorp.com/blog/injecting-vault-secrets-into-kubernetes-pods-via-a-sidecar) for now its out of scope. I will close this issue for now .

misohu commented

Reopennig looks like locally I had metacontroller v3 which is not in 1.7 bundle for kubeflow (there is v2) . We downgraded the metacontroller to v2 because kubeflow-profile-controller had issues with v3. We need v3 for resource dispatcher to correctly update the manifests in user namespaces.

misohu commented

Fixed by updating kfp-profile-controller to use Decorator controller canonical/kfp-operators#344 and then updating metacontroller to v3 canonical/metacontroller-operator#94