Quickstart broken: `modelmesh-serving` svc not created in `--namespace-scope-mode`, ISVCs never get to `READY` state
ckadner opened this issue · 7 comments
Describe the bug
The quickstart install instructions no longer work correctly. After deploying a model, the InferenceService
does not get into READY
state. Inference requests cannot be performed.
To Reproduce
- Follow the Quickstart guide the point of deploying a model and wait for the runtime pods to be
Running
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-558fb445c-ht596 1/1 Running 0 9h
minio-774d48cf7-nbkd7 1/1 Running 0 9h
modelmesh-controller-86dd54cc54-xhnvv 1/1 Running 0 9h
modelmesh-serving-mlserver-1.x-647d4f8f9c-h5b49 4/4 Running 0 26m
modelmesh-serving-mlserver-1.x-647d4f8f9c-hv6ht 4/4 Running 0 26m
- Wait a few minutes (or hours) and check the status of the newly created
InferenceService
.
$ kubectl get isvc
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
example-sklearn-isvc 26m
- Describe the ISVC, find no events:
$ kubectl describe isvc example-sklearn-isvc
Name: example-sklearn-isvc
Namespace: modelmesh-serving
Labels: <none>
Annotations: serving.kserve.io/deploymentMode: ModelMesh
serving.kserve.io/secretKey: localMinIO
API Version: serving.kserve.io/v1beta1
Kind: InferenceService
Metadata:
Creation Timestamp: 2023-09-20T03:21:13Z
Generation: 2
Managed Fields:
API Version: serving.kserve.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:serving.kserve.io/deploymentMode:
f:serving.kserve.io/secretKey:
f:spec:
.:
f:predictor:
.:
f:model:
.:
f:modelFormat:
.:
f:name:
f:storageUri:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2023-09-20T03:46:54Z
Resource Version: 61292
UID: 2d2acd78-e0ce-4864-bd3f-54944d02ae85
Spec:
Predictor:
Model:
Model Format:
Name: sklearn
Storage Uri: s3://modelmesh-example-models/sklearn/mnist-svm.joblib
Events: <none>
- Ignore all that for a minute and go on to create the port-forward:
$ kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving
Error from server (NotFound): services "modelmesh-serving" not found
- Check the service list:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.141.59 <none> 2379/TCP 9h
minio ClusterIP 172.21.11.102 <none> 9000/TCP 9h
modelmesh-webhook-server-service ClusterIP 172.21.232.100 <none> 9443/TCP 9h
- Note, the service "modelmesh-serving" does not exist.
Expected behavior
Up until release v0.11.0
when following the instructions in the Quickstart guide the inference service was created successfully, port-forward worked and inference requests returned expected response.
Additional context
How do the FVT test cases work? Curiously, when using the FVT install instructions using the --fvt
flag (cluster-scoped install), the modelmesh-serving
service does get created and the ISVC transitions to READY
state okay.
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.33.247 <none> 2379/TCP 2m7s
minio ClusterIP 172.21.215.204 <none> 9000/TCP 2m6s
modelmesh-serving ClusterIP None <none> 8033/TCP,8008/TCP,2112/TCP 75s
modelmesh-webhook-server-service ClusterIP 172.21.82.140 <none> 9443/TCP 104s
Environment (please complete the following information):
- OS: macOS client side, Ubuntu 20 server side
- Browser: [n/a]
- Version: MM latest of
main
, Kubernetes 1.25, 1.26, 1.27
UPDATE 2023-09-21:
FYI @Jooho
I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving
service gets created?
W0921 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
E0921 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
I tagged some recent commits on main
and triggered builds using the GH actions.
git tag -a pr-422 0b39182 -m "PR #422"
git push upstream pr-422
git tag -a pr-397 dd7277a -m "PR #397"
git push upstream pr-397
git tag -a pr-428 f47cd7b -m "PR #428"
git push upstream pr-428
- https://github.com/kserve/modelmesh-serving/tags
- https://github.com/kserve/modelmesh-serving/actions/workflows/build.yml
- https://hub.docker.com/r/kserve/modelmesh-controller/tags
The new pr-xxx
images on Dockerhub help to "go back in time" and to deploy older versions of the modelmesh-controller
right after a PR was merged.
Before PR #397
git checkout pr-428
HEAD is now at f47cd7b fix: Support devel version of Kustomize in install script (#428)
# replace newTag for controller image
git diff | grep -E "^\+"
+++ b/config/manager/kustomization.yaml
+ newTag: pr-428
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca
Successfully installed ModelMesh Serving!
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.171.73 <none> 2379/TCP 2m7s
minio ClusterIP 172.21.190.108 <none> 9000/TCP 2m7s
modelmesh-serving ClusterIP None <none> 8033/TCP,8008/TCP,2112/TCP 99s
modelmesh-webhook-server-service ClusterIP 172.21.99.254 <none> 9443/TCP 110s
After PR #397
git checkout pr-397
HEAD is now at dd7277a fix: Goroutine memory leak (#397)
# replace newTag for controller image
git diff | grep -E "^\+"
+++ b/config/manager/kustomization.yaml
+ newTag: pr-397
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca
Successfully installed ModelMesh Serving!
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.219.47 <none> 2379/TCP 6m37s
minio ClusterIP 172.21.177.41 <none> 9000/TCP 6m36s
modelmesh-webhook-server-service ClusterIP 172.21.249.37 <none> 9443/TCP 6m14s
Missing service modelmesh-serving
.
Some differences in the controller logs, before and after PR #397
- in order of appearance in logs
- timestamps removed
D
=DEBUG
,I
=INFO
- indentation changed for readability
Unique log messages before PR #397
(don't happen after PR #397
)
D KubeResolver Built new resolver {"target": {"Scheme":"kube","Authority":"","URL":{"Scheme":"kube","Opaque":"","User":null,"Host":"","Path":"/modelmesh-serving.modelmesh-serving:8033","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}, "name": "modelmesh-serving/modelmesh-serving"}
D KubeResolver Failed to update resolver due to bad state, requeuing endpoint reconciliation
D KubeResolver.Reconcile Ignoring event for Endpoints with no resolver {"endpoints": "modelmesh-serving/modelmesh-serving"}
D ModelMeshEventStream Etcd config secret changed. Creating a new etcd client and restarting watchers. {"oldSecretName": "", "newSecretName": "model-serving-etcd"}
D ModelMeshEventStream ModelMesh Model Event {"namespace": "modelmesh-serving", "modelId": "", "event": "INITIALIZED"}
D ModelMeshEventStream ModelMesh VModel Event {"namespace": "modelmesh-serving", "vModelId": "", "event": "INITIALIZED"}
I MMService Established new MM gRPC connection {"namespace": "modelmesh-serving", "endpoint": "kube:///modelmesh-serving.modelmesh-serving:8033", "TLS": false}
I ModelMeshEventStream Initialize Model Event Stream {"namespace": "modelmesh-serving", "servicePrefix": "modelmesh-serving/mm/modelmesh-serving"}
I ModelMeshEventStream EtcdRangeWatcher starting {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/registry/"}
I ModelMeshEventStream EtcdRangeWatcher starting {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/vmodels/"}
I controllers.Service Updated Kube Service {"namespace": "modelmesh-serving", "name": "modelmesh-serving", }
Unique log messages after PR #397
(did not happen before PR #397
)
E0921 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
W0921 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
Different log messages (changed after PR `#397`)
-
before PR
#397
:I0921 1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
-
after PR
#397
:I0921 1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1?timeout=32s
FYI @Jooho @tjohnson31415
We should create a separate FVT workflow to test the namespace-scope mode -- to run in parallel to the existing cluster-scope FVT