Service "cdi-api" not found when trying to create a data-volume; operator installation possibly broken
manas-suleman opened this issue · 7 comments
What happened:
Summary: operator installation appears to be incomplete/broken without any obvious errors. Details below.
I have a new installation of hyper-converged cluster operator 1.11.0. The operator installs successfully and I see all pod in "RUNNING" state.
output for oc get pods -n kubevirt-hyperconverged
:
aaq-operator-57b7577bd7-8sglq 1/1 Running 0 4d9h
bridge-marker-ghbns 1/1 Running 0 4d9h
bridge-marker-sjnm5 1/1 Running 0 4d9h
bridge-marker-xcknp 1/1 Running 1 4d9h
cdi-operator-85dd66559c-f7zgk 1/1 Running 0 4d9h
cluster-network-addons-operator-7444bdfdff-bpdwd 2/2 Running 0 4d9h
hco-operator-b467b7bdb-sfxhk 1/1 Running 0 4d9h
hco-webhook-858886f5fb-2wpt2 1/1 Running 0 4d9h
hostpath-provisioner-operator-5795d65b6c-59945 1/1 Running 0 4d9h
hyperconverged-cluster-cli-download-6cc96f65d5-4hr8p 1/1 Running 0 4d9h
kube-cni-linux-bridge-plugin-4z4xb 1/1 Running 0 4d9h
kube-cni-linux-bridge-plugin-l4lrf 1/1 Running 1 4d9h
kube-cni-linux-bridge-plugin-lt7zg 1/1 Running 0 4d9h
kubemacpool-cert-manager-75f9c84d8-5hmgn 1/1 Running 0 4d9h
kubemacpool-mac-controller-manager-87577f75-kfhmm 2/2 Running 0 4d9h
kubevirt-apiserver-proxy-748654ffc7-9l6fj 1/1 Running 0 4d9h
kubevirt-console-plugin-54c65c9d79-hfcc2 1/1 Running 0 4d9h
mtq-operator-9b55bdd8b-7bg9j 1/1 Running 0 4d9h
ssp-operator-75dc646fd8-n5w7q 1/1 Running 0 4d9h
virt-api-56bdddd94-p2gps 1/1 Running 0 4d9h
virt-api-56bdddd94-w7jwl 1/1 Running 0 4d9h
virt-controller-5956594b98-l2rhn 1/1 Running 0 4d9h
virt-controller-5956594b98-l2x79 1/1 Running 0 4d9h
virt-exportproxy-57968cd7bc-67qsq 1/1 Running 0 4d9h
virt-exportproxy-57968cd7bc-6wsg2 1/1 Running 0 4d9h
virt-handler-5qx8n 1/1 Running 0 4d9h
virt-handler-7pd77 1/1 Running 0 4d9h
virt-handler-b7l9b 1/1 Running 0 3d14h
virt-operator-7bc55bf444-q88dt 1/1 Running 0 4d9h
virt-operator-7bc55bf444-xpq2j 1/1 Running 0 4d9h
I'm trying to follow this blog for a windows VM installation.
For the first step, I'm using the following command:
virtctl image-upload pvc windows11-iso --image-path=./Win11_23H2_EnglishInternational_x64v2.iso --size=7Gi --namespace kubevirt-os-images --insecure --access-mode=ReadWriteMany
output:
Using existing PVC kubevirt-os-images/windows11-iso
Waiting for PVC windows11-iso upload pod to be ready...
timed out waiting for the condition
for the above command, a pvc get created but I don't see any upload pod being created. Also, I don't see any errors in any of the pods in the namespace related to the above command. Apparently there needs to be a upload server pod as well but can't find it either. After a while, the command times out with the last line of command output.
To get around this issue, I tried following this advice.
I define the following dv:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: upload-datavolume
spec:
source:
upload: {}
pvc:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 7Gi
Use this command to create it.
oc apply -f upload-dv.yaml
This is the output I get:
Error from server (InternalError): error when creating "upload-dv.yaml": Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": failed to call webhook: Post "https://cdi-api.kubevirt-hyper.svc:443/datavolume-mutate?timeout=30s": service "cdi-api" not found
The output implies there should be a service with the name "cdi-api" with internal hostname of cdi-api.kubevirt-hyper.svc but such service doesn't exist. I have the following services currently.
oc get svc -n kubevirt-hyperconverged
:
cluster-network-addons-operator-prometheus-metrics ClusterIP 11.22.114.219 <none> 8443/TCP 4d9h
hco-webhook-service ClusterIP 11.22.88.231 <none> 4343/TCP 4d9h
hostpath-provisioner-operator-service ClusterIP 11.22.14.60 <none> 9443/TCP 4d9h
hyperconverged-cluster-cli-download ClusterIP 11.22.10.30 <none> 8080/TCP 4d9h
kubemacpool-service ClusterIP 11.22.75.35 <none> 443/TCP 4d9h
kubevirt-apiserver-proxy-service ClusterIP 11.22.215.194 <none> 8080/TCP 4d9h
kubevirt-console-plugin-service ClusterIP 11.22.243.97 <none> 9443/TCP 4d9h
kubevirt-hyperconverged-operator-metrics ClusterIP 11.22.91.100 <none> 8383/TCP 4d9h
kubevirt-operator-webhook ClusterIP 11.22.123.154 <none> 443/TCP 4d9h
kubevirt-prometheus-metrics ClusterIP None <none> 443/TCP 4d9h
ssp-operator-metrics ClusterIP 11.22.141.185 <none> 443/TCP 4d9h
ssp-operator-service ClusterIP 11.22.208.40 <none> 9443/TCP 4d9h
virt-api ClusterIP 11.22.150.166 <none> 443/TCP 4d9h
virt-exportproxy ClusterIP 11.22.250.234 <none> 443/TCP 4d9h
(IPs have been changed in the above output)
At this point the assumption was that cdi wasn't successfully installed by the operator so I tried to install it separately using the following guide but that didn't fix the issue either.
the only indication I see that the operator installation may not have been successful despite it showing so is below screenshot in OKD "Overview" tab. But even that doesn't detail any alerts related to the "degraded" status so not sure if it's true. Also, not sure how to troubleshoot this.
What you expected to happen:
upload pod to be created and ISO image to be uploaded to persistent volume claim. data volume to be created without any errors.
How to reproduce it (as minimally and precisely as possible):
In bare-metal installation of OKD4.15, install hyperconverged cluster operator version 1.11.0 from the "OperatorHub".
Additional context:
Add any other context about the problem here.
Environment:
- KubeVirt version (use
virtctl version
): v1.1.1 - Kubernetes version (use
kubectl version
): v1.28.2-3598+6e2789bbd58938-dirty - VM or VMI specifications: N/A
- Cloud provider or hardware configuration: baremetal
- OS (e.g. from /etc/os-release): Fedora CoreOS 39.20240210.3.0
- Kernel (e.g.
uname -a
): 6.7.4-200.fc39.x86_64 - Install tools: N/A
- Others: N/A
@manas-suleman , on OKD 4.15.0-0.okd-2024-03-10-010116
kubevirt-hyperconverged v1.11.0
is working for me out of the box as shipped on the community-operators
catalog:
stirabos@tiraboschip1:~$ oc version
Client Version: 4.15.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.15.0-0.okd-2024-03-10-010116
Kubernetes Version: v1.28.2-3598+6e2789bbd58938-dirty
stirabos@tiraboschip1:~$ oc get sub -n kubevirt-hyperconverged
NAME PACKAGE SOURCE CHANNEL
community-kubevirt-hyperconverged community-kubevirt-hyperconverged community-operators 1.11.0
stirabos@tiraboschip1:~$ oc get pods -n kubevirt-hyperconverged
NAME READY STATUS RESTARTS AGE
aaq-operator-698ff79599-j4pr6 1/1 Running 0 9m30s
bridge-marker-2fd7s 1/1 Running 0 7m32s
bridge-marker-5rz7q 1/1 Running 0 7m32s
bridge-marker-cjbq7 1/1 Running 0 7m32s
bridge-marker-d5dz5 1/1 Running 0 7m32s
bridge-marker-fbqxv 1/1 Running 0 7m32s
bridge-marker-pv5k6 1/1 Running 0 7m32s
cdi-apiserver-64d85bb898-jpf9t 1/1 Running 0 7m29s
cdi-deployment-78c94b68dc-b9w4q 1/1 Running 0 7m29s
cdi-operator-6d76766945-lnz6x 1/1 Running 0 9m31s
cdi-uploadproxy-7779ddfc6b-4hn98 1/1 Running 0 7m30s
cluster-network-addons-operator-69f755cfcf-zqh9z 2/2 Running 0 9m54s
hco-operator-66ccf794cd-gl4zf 1/1 Running 0 9m55s
hco-webhook-56c79c57fc-g52fv 1/1 Running 0 9m54s
hostpath-provisioner-operator-766c7889d4-zr46g 1/1 Running 0 9m31s
hyperconverged-cluster-cli-download-64fbfd4497-29kst 1/1 Running 0 9m54s
kube-cni-linux-bridge-plugin-bgx5q 1/1 Running 0 7m32s
kube-cni-linux-bridge-plugin-fsfz6 1/1 Running 0 7m32s
kube-cni-linux-bridge-plugin-hts97 1/1 Running 0 7m32s
kube-cni-linux-bridge-plugin-lcd44 1/1 Running 0 7m32s
kube-cni-linux-bridge-plugin-vszsg 1/1 Running 0 7m32s
kube-cni-linux-bridge-plugin-xfjqz 1/1 Running 0 7m32s
kubemacpool-cert-manager-75f9c84d8-brdmd 1/1 Running 0 7m32s
kubemacpool-mac-controller-manager-87577f75-5j8dh 2/2 Running 0 7m31s
kubevirt-apiserver-proxy-748654ffc7-8c79s 1/1 Running 0 7m31s
kubevirt-console-plugin-54c65c9d79-mqnst 1/1 Running 0 7m31s
mtq-operator-6f7d9d96db-xr6rb 1/1 Running 0 9m30s
ssp-operator-5d4cc47887-rbzsk 1/1 Running 1 (6m57s ago) 9m53s
virt-api-56bdddd94-p2z6l 1/1 Running 0 6m49s
virt-api-56bdddd94-s7n24 1/1 Running 0 6m49s
virt-controller-5956594b98-54rls 1/1 Running 0 6m24s
virt-controller-5956594b98-xmrss 1/1 Running 0 6m24s
virt-exportproxy-57968cd7bc-gsfsm 1/1 Running 0 6m23s
virt-exportproxy-57968cd7bc-zg6mr 1/1 Running 0 6m24s
virt-handler-2vdxt 1/1 Running 0 6m23s
virt-handler-kc84h 1/1 Running 0 6m23s
virt-handler-wqxv2 1/1 Running 0 6m23s
virt-operator-5fdb4bdc96-78mj5 1/1 Running 1 (19s ago) 9m32s
virt-operator-5fdb4bdc96-f4ms8 1/1 Running 0 9m32s
virt-template-validator-777fd88fbb-9s9xw 1/1 Running 0 6m25s
virt-template-validator-777fd88fbb-pknhp 1/1 Running 0 6m25s
stirabos@tiraboschip1:~$ oc get pods -n simone
NAME READY STATUS RESTARTS AGE
virt-launcher-fedora-maroon-macaw-44-z77bg 1/1 Running 0 2m27s
stirabos@tiraboschip1:~$ oc get vm -n simone
NAME AGE STATUS READY
fedora-maroon-macaw-44 2m35s Running True
stirabos@tiraboschip1:~$ virtctl console -n simone fedora-maroon-macaw-44
Successfully connected to fedora-maroon-macaw-44 console. The escape sequence is ^]
fedora-maroon-macaw-44 login: fedora
Password:
[fedora@fedora-maroon-macaw-44 ~]$
[fedora@fedora-maroon-macaw-44 ~]$
[fedora@fedora-maroon-macaw-44 ~]$ whoami
fedora
[fedora@fedora-maroon-macaw-44 ~]$
[fedora@fedora-maroon-macaw-44 ~]$ exit
logout
Fedora Linux 40 (Cloud Edition)
Kernel 6.8.5-301.fc40.x86_64 on an x86_64 (ttyS0)
eth0: 10.0.2.2 fe80::92:2dff:fe00:0
fedora-maroon-macaw-44 login:
stirabos@tiraboschip1:~$
Can you please share the list of deployments in the kubevirt-hyperconverged
namespace and the logs of the cdi-operator pod?
$ oc get deployment -n kubevirt-hyperconverged
NAME READY UP-TO-DATE AVAILABLE AGE
aaq-operator 1/1 1 1 15m
cdi-apiserver 1/1 1 1 12m
cdi-deployment 1/1 1 1 12m
cdi-operator 1/1 1 1 15m
cdi-uploadproxy 1/1 1 1 12m
cluster-network-addons-operator 1/1 1 1 15m
hco-operator 1/1 1 1 15m
hco-webhook 1/1 1 1 15m
hostpath-provisioner-operator 1/1 1 1 15m
hyperconverged-cluster-cli-download 1/1 1 1 15m
kubemacpool-cert-manager 1/1 1 1 12m
kubemacpool-mac-controller-manager 1/1 1 1 12m
kubevirt-apiserver-proxy 1/1 1 1 12m
kubevirt-console-plugin 1/1 1 1 12m
mtq-operator 1/1 1 1 15m
ssp-operator 1/1 1 1 15m
virt-api 2/2 2 2 12m
virt-controller 2/2 2 2 11m
virt-exportproxy 2/2 2 2 11m
virt-operator 2/2 2 2 15m
virt-template-validator 2/2 2 2 11m
Hi @tiraboschi,
Thanks for your reply. Here are the deployments for kubevirt-hyperconverged namespace.
oc get deployments -n kubevirt-hyperconverged
NAME READY UP-TO-DATE AVAILABLE AGE
aaq-operator 1/1 1 1 9d
cdi-operator 1/1 1 1 9d
cluster-network-addons-operator 1/1 1 1 9d
hco-operator 1/1 1 1 9d
hco-webhook 1/1 1 1 9d
hostpath-provisioner-operator 1/1 1 1 9d
hyperconverged-cluster-cli-download 1/1 1 1 9d
kubemacpool-cert-manager 1/1 1 1 9d
kubemacpool-mac-controller-manager 1/1 1 1 9d
kubevirt-apiserver-proxy 1/1 1 1 9d
kubevirt-console-plugin 1/1 1 1 9d
mtq-operator 1/1 1 1 9d
ssp-operator 1/1 1 1 9d
virt-api 2/2 2 2 9d
virt-controller 2/2 2 2 9d
virt-exportproxy 2/2 2 2 9d
virt-operator 2/2 2 2 9d
looks like I don't have the cdi-apiserver
, cdi-deployment
, cdi-uploadproxy
and virt-template-validator
deployments. I can try reinstalling the operator and get back to you. Please let me know if you have any thoughts about this.
Thanks,
Do you have enough resources on that environment?
We need cdi-operator
logs to understand why it failed there.
I just uploaded a quick video walkthrough about deploying on OKD and starting your first VM: #3001
I believe there is more than enough compute and memory resource on the servers. for the logs, sorry forgot to attach them in the last comment. Here they are, although the earliest log entry in the logs is for today and operator was installed over a week ago so doubt these would be of any help.
cdi-operator-85dd66559c-f7zgk-cdi-operator.log
that being said, thanks a lot for the walkthrough video, I'd do a redeployment following the steps in that and share the updated logs if I still see the issue.
cheers,
here the error is clearly:
{"level":"error","ts":"2024-06-14T16:13:30Z","logger":"cdi-operator","msg":"error getting apiserver ca bundle","error":"ConfigMap \"cdi-apiserver-signer-bundle\" not found", ...
in a loop.
That ConfigMap should be created by cdi-operator itself.
We need to understand why it got missed on your cluster.
On cdi-operator logs on my fresh environment I see:
{"level":"debug","ts":"2024-06-14T15:35:38Z","logger":"events","msg":"Successfully created resource *v1.ConfigMap cdi-apiserver-signer-bundle","type":"Normal","object":{"kind":"CDI","name":"cdi-kubevirt-hyperconverged","uid":"dbdcad93-bf55-46d0-98f1-202ece99cefa","apiVersion":"cdi.kubevirt.io/v1beta1","resourceVersion":"87065"},"reason":"CreateResourceSuccess"}
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
I'm not sure what caused partial deployment to fail and some objects like configmaps to not be created in the first place. But I was able to fix it by just reinstalling the operator.