MLFlow credentials will not update on existing or new Kubeflow Notebooks
Closed this issue · 8 comments
Versions
Kubernetes platform: Microk8s (v1.24.16)
Juju Agents: 2.9.43
Juju CLI: 2.9.44-ubuntu-amd64
resource-dispatcher: edge (rev 78)
Kubeflow: 1.7/stable
MLFlow: 2.1/edge
Rest of charm versions visible on juju status
Reproduce
The integration between mlflow and kubeflow notebooks was working for me until I decided to update mlflow-minio credentials
juju config mlflow-minio secret-key=miniominio
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
UnitId: mlflow-server/0
id: "4"
results:
access-key: minio
secret-access-key: miniominio
status: completed
timing:
completed: 2023-08-07 05:00:50 +0000 UTC
enqueued: 2023-08-07 05:00:47 +0000 UTC
started: 2023-08-07 05:00:48 +0000 UTC
Logs
Now testing the S3 credentials on existing, as well as on new notebooks I would still be getting the old secre-key
# Notebook POV
!aws --endpoint-url $MLFLOW_S3_ENDPOINT_URL s3 mb s3://mlflow
# Output
make_bucket failed: s3://mlflow An error occurred (SignatureDoesNotMatch) when calling the CreateBucket operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.
# Terminal POV inside a Jupyter Notebook
# It is printing the old access-secret
env | grep AWS
AWS_SECRET_ACCESS_KEY=9TJTFVPTXBZ8QB1YZBP0JUYOIID6XC
AWS_ACCESS_KEY_ID=minio
I tried:
A. Removing and adding secret relation, but still the same result
juju remove-relation mlflow-server:secrets resource-dispatcher:secrets
juju add-relation mlflow-server:secrets resource-dispatcher:secrets
B. Removing the application and redeploying, but still the same result
juju deploy resource-dispatcher --channel edge --trust
juju relate mlflow-server:secrets resource-dispatcher:secrets
juju relate mlflow-server:pod-defaults resource-dispatcher:pod-defaults
I could not find any workaround to it so far
The only thing I could do was rollback the secret-key change
Which, I am not sure if it is intended but after re-setting to no value I got the same secret-key I had before changing its value
juju config mlflow-minio --reset secret-key
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
UnitId: mlflow-server/0
id: "6"
results:
access-key: minio
secret-access-key: 9TJTFVPTXBZ8QB1YZBP0JUYOIID6XC
status: completed
timing:
completed: 2023-08-07 05:21:08 +0000 UTC
enqueued: 2023-08-07 05:21:02 +0000 UTC
started: 2023-08-07 05:21:07 +0000 UTC
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow microk8s-localhost microk8s/localhost 2.9.43 unsupported 05:31:10Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@2d74d1b active 1 admission-webhook 1.7/stable 205 10.152.183.52 no
argo-controller res:oci-image@669ebd5 active 1 argo-controller 3.3/stable 236 no
argo-server res:oci-image@576d038 active 1 argo-server 3.3/stable 185 no
dex-auth active 1 dex-auth 2.31/stable 224 10.152.183.51 no
istio-ingressgateway active 1 istio-gateway 1.16/stable 551 10.152.183.134 no
istio-pilot active 1 istio-pilot 1.16/stable 551 10.152.183.201 no
jupyter-controller res:oci-image@1167186 active 1 jupyter-controller 1.7/stable 607 no
jupyter-ui active 1 jupyter-ui 1.7/stable 534 10.152.183.96 no
katib-controller res:oci-image@111495a active 1 katib-controller 0.15/stable 282 10.152.183.197 no
katib-db 8.0.32-0ubuntu0.22.04.2 active 1 mysql-k8s 8.0/stable 75 10.152.183.158 no Primary
katib-db-manager active 1 katib-db-manager 0.15/stable 253 10.152.183.33 no
katib-ui active 1 katib-ui 0.15/stable 267 10.152.183.84 no
kfp-api active 1 kfp-api 2.0/stable 540 10.152.183.125 no
kfp-db 8.0.32-0ubuntu0.22.04.2 active 1 mysql-k8s 8.0/stable 75 10.152.183.56 no Primary
kfp-persistence res:oci-image@516e6b8 active 1 kfp-persistence 2.0/stable 500 no
kfp-profile-controller res:oci-image@b26a126 active 1 kfp-profile-controller 2.0/stable 478 10.152.183.81 no
kfp-schedwf res:oci-image@68cce0a active 1 kfp-schedwf 2.0/stable 515 no
kfp-ui res:oci-image@ae72602 active 1 kfp-ui 2.0/stable 504 10.152.183.101 no
kfp-viewer res:oci-image@c0f065d active 1 kfp-viewer 2.0/stable 517 no
kfp-viz res:oci-image@3de6f3c active 1 kfp-viz 2.0/stable 476 10.152.183.64 no
knative-eventing active 1 knative-eventing 1.8/stable 224 10.152.183.59 no
knative-operator active 1 knative-operator 1.8/stable 199 10.152.183.30 no
knative-serving active 1 knative-serving 1.8/stable 224 10.152.183.224 no
kserve-controller active 1 kserve-controller 0.10/stable 267 10.152.183.143 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.7/stable 307 10.152.183.178 no
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 269 10.152.183.161 no
kubeflow-roles active 1 kubeflow-roles 1.7/stable 113 10.152.183.43 no
kubeflow-volumes res:oci-image@d261609 active 1 kubeflow-volumes 1.7/stable 178 10.152.183.113 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 117 10.152.183.106 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 186 10.152.183.226 no
mlflow-minio res:oci-image@1755999 active 1 minio ckf-1.7/edge 186 10.152.183.225 no
mlflow-mysql 8.0.32-0ubuntu0.22.04.2 active 1 mysql-k8s 8.0/stable 75 10.152.183.185 no Primary
mlflow-server active 1 mlflow-server latest/edge 346 no
oidc-gatekeeper res:oci-image@6b720b8 active 1 oidc-gatekeeper ckf-1.7/stable 176 10.152.183.109 no
resource-dispatcher active 1 resource-dispatcher edge 78 10.152.183.41 no
seldon-controller-manager active 1 seldon-core 1.15/stable 457 10.152.183.3 no
tensorboard-controller res:oci-image@c52f7c2 active 1 tensorboard-controller 1.7/stable 156 10.152.183.45 no
tensorboards-web-app res:oci-image@929f55b active 1 tensorboards-web-app 1.7/stable 158 10.152.183.80 no
training-operator active 1 training-operator 1.6/stable 215 10.152.183.210 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.149.233 4443/TCP
argo-controller/0* active idle 10.1.150.30
argo-server/0* active idle 10.1.149.241 2746/TCP
dex-auth/0* active idle 10.1.149.215
istio-ingressgateway/0* active idle 10.1.149.250
istio-pilot/0* active idle 10.1.149.248
jupyter-controller/0* active idle 10.1.149.228
jupyter-ui/0* active idle 10.1.149.199
katib-controller/0* active idle 10.1.149.211 443/TCP,8080/TCP
katib-db-manager/0* active idle 10.1.149.238
katib-db/0* active idle 10.1.149.207 Primary
katib-ui/0* active idle 10.1.149.254
kfp-api/0* active idle 10.1.149.210
kfp-db/0* active idle 10.1.149.232 Primary
kfp-persistence/0* active idle 10.1.150.0
kfp-profile-controller/0* active idle 10.1.150.31 80/TCP
kfp-schedwf/0* active idle 10.1.149.240
kfp-ui/0* active idle 10.1.150.53 3000/TCP
kfp-viewer/0* active idle 10.1.149.242
kfp-viz/0* active idle 10.1.150.39 8888/TCP
knative-eventing/0* active idle 10.1.149.247
knative-operator/0* active idle 10.1.149.200
knative-serving/0* active idle 10.1.149.217
kserve-controller/0* active idle 10.1.149.234
kubeflow-dashboard/0* active idle 10.1.149.226
kubeflow-profiles/0* active idle 10.1.149.252
kubeflow-roles/0* active idle 10.1.149.219
kubeflow-volumes/0* active idle 10.1.150.51 5000/TCP
metacontroller-operator/0* active idle 10.1.149.235
minio/0* active idle 10.1.150.27 9000/TCP,9001/TCP
mlflow-minio/0* active idle 10.1.150.50 9000/TCP,9001/TCP
mlflow-mysql/0* active idle 10.1.149.245 Primary
mlflow-server/0* active idle 10.1.149.213
oidc-gatekeeper/1* active idle 10.1.150.52 8080/TCP
resource-dispatcher/0* active idle 10.1.149.243
seldon-controller-manager/0* active idle 10.1.149.225
tensorboard-controller/0* active idle 10.1.150.22 9443/TCP
tensorboards-web-app/0* active idle 10.1.150.54 5000/TCP
training-operator/0* active idle 10.1.149.224
I managed to reproduce the issue with one of our notebook tests by following the steps outlined in this issue.
---------------------------------------------------------------------------
SignatureDoesNotMatch Traceback (most recent call last)
Cell In[6], line 2
1 try:
----> 2 mc.make_bucket(MINIO_BUCKET)
3 except BucketAlreadyOwnedByYou:
4 print(f"Bucket {MINIO_BUCKET} already exists!")
File /opt/conda/lib/python3.8/site-packages/minio/api.py:336, in Minio.make_bucket(self, bucket_name, location, object_lock)
332 dump_http(method, url, headers, response,
333 self._trace_output_stream)
335 if response.status != 200:
--> 336 raise ResponseError(response, method, bucket_name).get_exception()
338 self._set_bucket_region(bucket_name, region=location)
SignatureDoesNotMatch: SignatureDoesNotMatch: message: The request signature we calculated does not match the signature you provided.
Minimal instructions to reporoduce.
- Create notebook with single cell:
!env | grep AWS
- Execute cell/notebook and observer AWS credentials (example key in output):
AWS_SECRET_ACCESS_KEY=WRR7QTDXG32F11HU1BTBAZS3N8FOE3
AWS_ACCESS_KEY_ID=minio
- Update and verify minio credentials:
juju config mlflow-minio secret-key=miniominio
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
UnitId: mlflow-server/0
id: "22"
results:
access-key: minio
secret-access-key: miniominio
status: completed
timing:
completed: 2023-08-17 20:47:35 +0000 UTC
enqueued: 2023-08-17 20:47:32 +0000 UTC
started: 2023-08-17 20:47:33 +0000 UTC
- Execute cell/notebook again, credentials should be updated:
AWS_SECRET_ACCESS_KEY=WRR7QTDXG32F11HU1BTBAZS3N8FOE3 # <<< should be miniominio
AWS_ACCESS_KEY_ID=minio
Looks like key is not updated in user's namespace secret. Should it be?
- Update and verify key
- Check secret in user namespace
juju config mlflow-minio secret-key=miniominio
juju run-action mlflow-server/0 get-minio-credentials --wait
unit-mlflow-server-0:
UnitId: mlflow-server/0
id: "28"
results:
access-key: minio
secret-access-key: miniominio
status: completed
timing:
completed: 2023-08-17 20:58:51 +0000 UTC
enqueued: 2023-08-17 20:58:38 +0000 UTC
started: 2023-08-17 20:58:49 +0000 UTC
microk8s.kubectl -n admin get secret mlflow-server-minio-artifact -o=yaml
apiVersion: v1
data:
AWS_ACCESS_KEY_ID: bWluaW8=
AWS_SECRET_ACCESS_KEY: V1JSN1FURFhHMzJGMTFIVTFCVEJBWlMzTjhGT0Uz
kind: Secret
metadata:
annotations:
metacontroller.k8s.io/decorator-controller: kubeflow-resource-dispatcher-controller
metacontroller.k8s.io/last-applied-configuration: '{"apiVersion":"v1","kind":"Secret","metadata":{"annotations":{"metacontroller.k8s.io/decorator-controller":"kubeflow-resource-dispatcher-controller"},"name":"mlflow-server-minio-artifact","namespace":"admin"},"stringData":{"AWS_ACCESS_KEY_ID":"minio","AWS_SECRET_ACCESS_KEY":"WRR7QTDXG32F11HU1BTBAZS3N8FOE3"}}'
creationTimestamp: "2023-08-17T18:34:08Z"
name: mlflow-server-minio-artifact
namespace: admin
resourceVersion: "31235"
uid: 448cc9b7-17fb-4839-9772-c3b9ee2cc5fe
type: Opaque
After the updating key, the relation mlflow-server:secrets resource-dispatcher:secrets
contains updated key:
From jhack:
"AWS_ACCESS_KEY_…
"minio"
"AWS_SECRET_ACCES…
"miniominio"
Candidate for fix in 1.8
This is expected behavior let me explain:
Resource dispatcher is responsible for distributing manifests across the namespaces with specific label. It is also responsible for updating those manifests in case the relation with resource dispatcher changes. That is what is happening. If you mlflow kubeflow plus resource dispatcher and you change the password for minio in mlflow bundle. resource dispatcher will update manifests in target namespaces. Its demonstrated also in Ivan's comment as he is pointing that the relation changes.
The error you are describing is just related how mounted secrets in Kubernetes work. Because we mount secret as environment variable to user's notebook, the environment will not automatically change on secret change (the secret in namespace changed but the notebook environment not). In order to use new environment with new value you need to restart pod.
If we want to automatically change the environment secret we would need to somehow monitor for secret change (e.g. with vault sidecar injector https://www.hashicorp.com/blog/injecting-vault-secrets-into-kubernetes-pods-via-a-sidecar) for now its out of scope. I will close this issue for now .
Reopennig looks like locally I had metacontroller v3 which is not in 1.7 bundle for kubeflow (there is v2) . We downgraded the metacontroller to v2 because kubeflow-profile-controller had issues with v3. We need v3 for resource dispatcher to correctly update the manifests in user namespaces.
Fixed by updating kfp-profile-controller to use Decorator controller canonical/kfp-operators#344 and then updating metacontroller to v3 canonical/metacontroller-operator#94