kubevirt/hyperconverged-cluster-operator

Service "cdi-api" not found when trying to create a data-volume; operator installation possibly broken

manas-suleman opened this issue · 7 comments

What happened:
Summary: operator installation appears to be incomplete/broken without any obvious errors. Details below.

I have a new installation of hyper-converged cluster operator 1.11.0. The operator installs successfully and I see all pod in "RUNNING" state.
image
output for oc get pods -n kubevirt-hyperconverged:

aaq-operator-57b7577bd7-8sglq                          1/1     Running   0          4d9h
bridge-marker-ghbns                                    1/1     Running   0          4d9h
bridge-marker-sjnm5                                    1/1     Running   0          4d9h
bridge-marker-xcknp                                    1/1     Running   1          4d9h
cdi-operator-85dd66559c-f7zgk                          1/1     Running   0          4d9h
cluster-network-addons-operator-7444bdfdff-bpdwd       2/2     Running   0          4d9h
hco-operator-b467b7bdb-sfxhk                           1/1     Running   0          4d9h
hco-webhook-858886f5fb-2wpt2                           1/1     Running   0          4d9h
hostpath-provisioner-operator-5795d65b6c-59945         1/1     Running   0          4d9h
hyperconverged-cluster-cli-download-6cc96f65d5-4hr8p   1/1     Running   0          4d9h
kube-cni-linux-bridge-plugin-4z4xb                     1/1     Running   0          4d9h
kube-cni-linux-bridge-plugin-l4lrf                     1/1     Running   1          4d9h
kube-cni-linux-bridge-plugin-lt7zg                     1/1     Running   0          4d9h
kubemacpool-cert-manager-75f9c84d8-5hmgn               1/1     Running   0          4d9h
kubemacpool-mac-controller-manager-87577f75-kfhmm      2/2     Running   0          4d9h
kubevirt-apiserver-proxy-748654ffc7-9l6fj              1/1     Running   0          4d9h
kubevirt-console-plugin-54c65c9d79-hfcc2               1/1     Running   0          4d9h
mtq-operator-9b55bdd8b-7bg9j                           1/1     Running   0          4d9h
ssp-operator-75dc646fd8-n5w7q                          1/1     Running   0          4d9h
virt-api-56bdddd94-p2gps                               1/1     Running   0          4d9h
virt-api-56bdddd94-w7jwl                               1/1     Running   0          4d9h
virt-controller-5956594b98-l2rhn                       1/1     Running   0          4d9h
virt-controller-5956594b98-l2x79                       1/1     Running   0          4d9h
virt-exportproxy-57968cd7bc-67qsq                      1/1     Running   0          4d9h
virt-exportproxy-57968cd7bc-6wsg2                      1/1     Running   0          4d9h
virt-handler-5qx8n                                     1/1     Running   0          4d9h
virt-handler-7pd77                                     1/1     Running   0          4d9h
virt-handler-b7l9b                                     1/1     Running   0          3d14h
virt-operator-7bc55bf444-q88dt                         1/1     Running   0          4d9h
virt-operator-7bc55bf444-xpq2j                         1/1     Running   0          4d9h

I'm trying to follow this blog for a windows VM installation.

For the first step, I'm using the following command:
virtctl image-upload pvc windows11-iso --image-path=./Win11_23H2_EnglishInternational_x64v2.iso --size=7Gi --namespace kubevirt-os-images --insecure --access-mode=ReadWriteMany
output:

Using existing PVC kubevirt-os-images/windows11-iso
Waiting for PVC windows11-iso upload pod to be ready...
timed out waiting for the condition

for the above command, a pvc get created but I don't see any upload pod being created. Also, I don't see any errors in any of the pods in the namespace related to the above command. Apparently there needs to be a upload server pod as well but can't find it either. After a while, the command times out with the last line of command output.

To get around this issue, I tried following this advice.

I define the following dv:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: upload-datavolume
spec:
  source:
      upload: {}
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 7Gi

Use this command to create it.
oc apply -f upload-dv.yaml

This is the output I get:

Error from server (InternalError): error when creating "upload-dv.yaml": Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": failed to call webhook: Post "https://cdi-api.kubevirt-hyper.svc:443/datavolume-mutate?timeout=30s": service "cdi-api" not found

The output implies there should be a service with the name "cdi-api" with internal hostname of cdi-api.kubevirt-hyper.svc but such service doesn't exist. I have the following services currently.
oc get svc -n kubevirt-hyperconverged:

cluster-network-addons-operator-prometheus-metrics   ClusterIP   11.22.114.219   <none>        8443/TCP   4d9h
hco-webhook-service                                  ClusterIP   11.22.88.231    <none>        4343/TCP   4d9h
hostpath-provisioner-operator-service                ClusterIP   11.22.14.60     <none>        9443/TCP   4d9h
hyperconverged-cluster-cli-download                  ClusterIP   11.22.10.30     <none>        8080/TCP   4d9h
kubemacpool-service                                  ClusterIP   11.22.75.35     <none>        443/TCP    4d9h
kubevirt-apiserver-proxy-service                     ClusterIP   11.22.215.194   <none>        8080/TCP   4d9h
kubevirt-console-plugin-service                      ClusterIP   11.22.243.97    <none>        9443/TCP   4d9h
kubevirt-hyperconverged-operator-metrics             ClusterIP   11.22.91.100    <none>        8383/TCP   4d9h
kubevirt-operator-webhook                            ClusterIP   11.22.123.154   <none>        443/TCP    4d9h
kubevirt-prometheus-metrics                          ClusterIP   None             <none>        443/TCP    4d9h
ssp-operator-metrics                                 ClusterIP   11.22.141.185   <none>        443/TCP    4d9h
ssp-operator-service                                 ClusterIP   11.22.208.40    <none>        9443/TCP   4d9h
virt-api                                             ClusterIP   11.22.150.166   <none>        443/TCP    4d9h
virt-exportproxy                                     ClusterIP   11.22.250.234   <none>        443/TCP    4d9h

(IPs have been changed in the above output)

At this point the assumption was that cdi wasn't successfully installed by the operator so I tried to install it separately using the following guide but that didn't fix the issue either.

the only indication I see that the operator installation may not have been successful despite it showing so is below screenshot in OKD "Overview" tab. But even that doesn't detail any alerts related to the "degraded" status so not sure if it's true. Also, not sure how to troubleshoot this.

image

What you expected to happen:
upload pod to be created and ISO image to be uploaded to persistent volume claim. data volume to be created without any errors.

How to reproduce it (as minimally and precisely as possible):
In bare-metal installation of OKD4.15, install hyperconverged cluster operator version 1.11.0 from the "OperatorHub".

Additional context:
Add any other context about the problem here.

Environment:

  • KubeVirt version (use virtctl version): v1.1.1
  • Kubernetes version (use kubectl version): v1.28.2-3598+6e2789bbd58938-dirty
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: baremetal
  • OS (e.g. from /etc/os-release): Fedora CoreOS 39.20240210.3.0
  • Kernel (e.g. uname -a): 6.7.4-200.fc39.x86_64
  • Install tools: N/A
  • Others: N/A

@manas-suleman , on OKD 4.15.0-0.okd-2024-03-10-010116 kubevirt-hyperconverged v1.11.0 is working for me out of the box as shipped on the community-operators catalog:

stirabos@tiraboschip1:~$ oc version
Client Version: 4.15.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.15.0-0.okd-2024-03-10-010116
Kubernetes Version: v1.28.2-3598+6e2789bbd58938-dirty
stirabos@tiraboschip1:~$ oc get sub -n kubevirt-hyperconverged
NAME                                PACKAGE                             SOURCE                CHANNEL
community-kubevirt-hyperconverged   community-kubevirt-hyperconverged   community-operators   1.11.0
stirabos@tiraboschip1:~$ oc get pods -n kubevirt-hyperconverged
NAME                                                   READY   STATUS    RESTARTS        AGE
aaq-operator-698ff79599-j4pr6                          1/1     Running   0               9m30s
bridge-marker-2fd7s                                    1/1     Running   0               7m32s
bridge-marker-5rz7q                                    1/1     Running   0               7m32s
bridge-marker-cjbq7                                    1/1     Running   0               7m32s
bridge-marker-d5dz5                                    1/1     Running   0               7m32s
bridge-marker-fbqxv                                    1/1     Running   0               7m32s
bridge-marker-pv5k6                                    1/1     Running   0               7m32s
cdi-apiserver-64d85bb898-jpf9t                         1/1     Running   0               7m29s
cdi-deployment-78c94b68dc-b9w4q                        1/1     Running   0               7m29s
cdi-operator-6d76766945-lnz6x                          1/1     Running   0               9m31s
cdi-uploadproxy-7779ddfc6b-4hn98                       1/1     Running   0               7m30s
cluster-network-addons-operator-69f755cfcf-zqh9z       2/2     Running   0               9m54s
hco-operator-66ccf794cd-gl4zf                          1/1     Running   0               9m55s
hco-webhook-56c79c57fc-g52fv                           1/1     Running   0               9m54s
hostpath-provisioner-operator-766c7889d4-zr46g         1/1     Running   0               9m31s
hyperconverged-cluster-cli-download-64fbfd4497-29kst   1/1     Running   0               9m54s
kube-cni-linux-bridge-plugin-bgx5q                     1/1     Running   0               7m32s
kube-cni-linux-bridge-plugin-fsfz6                     1/1     Running   0               7m32s
kube-cni-linux-bridge-plugin-hts97                     1/1     Running   0               7m32s
kube-cni-linux-bridge-plugin-lcd44                     1/1     Running   0               7m32s
kube-cni-linux-bridge-plugin-vszsg                     1/1     Running   0               7m32s
kube-cni-linux-bridge-plugin-xfjqz                     1/1     Running   0               7m32s
kubemacpool-cert-manager-75f9c84d8-brdmd               1/1     Running   0               7m32s
kubemacpool-mac-controller-manager-87577f75-5j8dh      2/2     Running   0               7m31s
kubevirt-apiserver-proxy-748654ffc7-8c79s              1/1     Running   0               7m31s
kubevirt-console-plugin-54c65c9d79-mqnst               1/1     Running   0               7m31s
mtq-operator-6f7d9d96db-xr6rb                          1/1     Running   0               9m30s
ssp-operator-5d4cc47887-rbzsk                          1/1     Running   1 (6m57s ago)   9m53s
virt-api-56bdddd94-p2z6l                               1/1     Running   0               6m49s
virt-api-56bdddd94-s7n24                               1/1     Running   0               6m49s
virt-controller-5956594b98-54rls                       1/1     Running   0               6m24s
virt-controller-5956594b98-xmrss                       1/1     Running   0               6m24s
virt-exportproxy-57968cd7bc-gsfsm                      1/1     Running   0               6m23s
virt-exportproxy-57968cd7bc-zg6mr                      1/1     Running   0               6m24s
virt-handler-2vdxt                                     1/1     Running   0               6m23s
virt-handler-kc84h                                     1/1     Running   0               6m23s
virt-handler-wqxv2                                     1/1     Running   0               6m23s
virt-operator-5fdb4bdc96-78mj5                         1/1     Running   1 (19s ago)     9m32s
virt-operator-5fdb4bdc96-f4ms8                         1/1     Running   0               9m32s
virt-template-validator-777fd88fbb-9s9xw               1/1     Running   0               6m25s
virt-template-validator-777fd88fbb-pknhp               1/1     Running   0               6m25s
stirabos@tiraboschip1:~$ oc get pods -n simone
NAME                                         READY   STATUS    RESTARTS   AGE
virt-launcher-fedora-maroon-macaw-44-z77bg   1/1     Running   0          2m27s
stirabos@tiraboschip1:~$ oc get vm -n simone
NAME                     AGE     STATUS    READY
fedora-maroon-macaw-44   2m35s   Running   True
stirabos@tiraboschip1:~$ virtctl console -n simone fedora-maroon-macaw-44
Successfully connected to fedora-maroon-macaw-44 console. The escape sequence is ^]

fedora-maroon-macaw-44 login: fedora
Password: 
[fedora@fedora-maroon-macaw-44 ~]$ 
[fedora@fedora-maroon-macaw-44 ~]$ 
[fedora@fedora-maroon-macaw-44 ~]$ whoami
fedora
[fedora@fedora-maroon-macaw-44 ~]$ 
[fedora@fedora-maroon-macaw-44 ~]$ exit
logout

Fedora Linux 40 (Cloud Edition)
Kernel 6.8.5-301.fc40.x86_64 on an x86_64 (ttyS0)

eth0: 10.0.2.2 fe80::92:2dff:fe00:0
fedora-maroon-macaw-44 login:
stirabos@tiraboschip1:~$ 

Can you please share the list of deployments in the kubevirt-hyperconverged namespace and the logs of the cdi-operator pod?

$ oc get deployment -n kubevirt-hyperconverged
NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
aaq-operator                          1/1     1            1           15m
cdi-apiserver                         1/1     1            1           12m
cdi-deployment                        1/1     1            1           12m
cdi-operator                          1/1     1            1           15m
cdi-uploadproxy                       1/1     1            1           12m
cluster-network-addons-operator       1/1     1            1           15m
hco-operator                          1/1     1            1           15m
hco-webhook                           1/1     1            1           15m
hostpath-provisioner-operator         1/1     1            1           15m
hyperconverged-cluster-cli-download   1/1     1            1           15m
kubemacpool-cert-manager              1/1     1            1           12m
kubemacpool-mac-controller-manager    1/1     1            1           12m
kubevirt-apiserver-proxy              1/1     1            1           12m
kubevirt-console-plugin               1/1     1            1           12m
mtq-operator                          1/1     1            1           15m
ssp-operator                          1/1     1            1           15m
virt-api                              2/2     2            2           12m
virt-controller                       2/2     2            2           11m
virt-exportproxy                      2/2     2            2           11m
virt-operator                         2/2     2            2           15m
virt-template-validator               2/2     2            2           11m

Hi @tiraboschi,

Thanks for your reply. Here are the deployments for kubevirt-hyperconverged namespace.

oc get deployments -n kubevirt-hyperconverged
NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
aaq-operator                          1/1     1            1           9d
cdi-operator                          1/1     1            1           9d
cluster-network-addons-operator       1/1     1            1           9d
hco-operator                          1/1     1            1           9d
hco-webhook                           1/1     1            1           9d
hostpath-provisioner-operator         1/1     1            1           9d
hyperconverged-cluster-cli-download   1/1     1            1           9d
kubemacpool-cert-manager              1/1     1            1           9d
kubemacpool-mac-controller-manager    1/1     1            1           9d
kubevirt-apiserver-proxy              1/1     1            1           9d
kubevirt-console-plugin               1/1     1            1           9d
mtq-operator                          1/1     1            1           9d
ssp-operator                          1/1     1            1           9d
virt-api                              2/2     2            2           9d
virt-controller                       2/2     2            2           9d
virt-exportproxy                      2/2     2            2           9d
virt-operator                         2/2     2            2           9d

looks like I don't have the cdi-apiserver, cdi-deployment, cdi-uploadproxy and virt-template-validator deployments. I can try reinstalling the operator and get back to you. Please let me know if you have any thoughts about this.

Thanks,

Do you have enough resources on that environment?
We need cdi-operator logs to understand why it failed there.
I just uploaded a quick video walkthrough about deploying on OKD and starting your first VM: #3001

I believe there is more than enough compute and memory resource on the servers. for the logs, sorry forgot to attach them in the last comment. Here they are, although the earliest log entry in the logs is for today and operator was installed over a week ago so doubt these would be of any help.

cdi-operator-85dd66559c-f7zgk-cdi-operator.log

that being said, thanks a lot for the walkthrough video, I'd do a redeployment following the steps in that and share the updated logs if I still see the issue.

cheers,

here the error is clearly:

{"level":"error","ts":"2024-06-14T16:13:30Z","logger":"cdi-operator","msg":"error getting apiserver ca bundle","error":"ConfigMap \"cdi-apiserver-signer-bundle\" not found", ...

in a loop.
That ConfigMap should be created by cdi-operator itself.

We need to understand why it got missed on your cluster.

On cdi-operator logs on my fresh environment I see:

{"level":"debug","ts":"2024-06-14T15:35:38Z","logger":"events","msg":"Successfully created resource *v1.ConfigMap cdi-apiserver-signer-bundle","type":"Normal","object":{"kind":"CDI","name":"cdi-kubevirt-hyperconverged","uid":"dbdcad93-bf55-46d0-98f1-202ece99cefa","apiVersion":"cdi.kubevirt.io/v1beta1","resourceVersion":"87065"},"reason":"CreateResourceSuccess"}

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

I'm not sure what caused partial deployment to fail and some objects like configmaps to not be created in the first place. But I was able to fix it by just reinstalling the operator.