openshift/cluster-nfd-operator

Could not install NodeFeatureDiscovery on OKD4.5 and 4.6 Cluster by operatorHub.

Closed this issue · 16 comments

On my OKD Cluster, I found the NFD operator on operatorHub, so I tried to install it from operatorHub.
But It was not installed.
(Before I found it on operatorHub, I used to install it manually, which I cloned from git and use make command to deploy.)

Installation proceeded in two ways ( Install on All Namespaces-default / specific Namespace ), but the following error occurred:
image

{"level":"info","ts":1616045944.9220765,"logger":"cmd","msg":"Go Version: go1.15.5"}
{"level":"info","ts":1616045944.9221418,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1616045944.9221659,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"}
{"level":"info","ts":1616045944.922823,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1616045946.8118556,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1616045946.8415146,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1616045950.8142676,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1616045950.8169394,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1616045950.817697,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1616045950.818323,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8187845,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8190293,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8192632,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8194685,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8199208,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1616045950.8208203,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8210897,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8213153,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.821549,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8217416,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8219752,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8221507,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"nodefeaturediscovery-controller"}
{"level":"info","ts":1616045951.0225158,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"nodefeaturediscovery-controller","worker count":1}

How can I solve this? Do anyone have same error with?

same error on ocp 4.6.19

Hi @lgc0313 I am going to fix this today, looks like the community operator on OperatorHub is outdated

@ArangoGutierrez Thanks.Waiting for repair online

operator-framework/community-operators#3402 is merged, give it a day to roll out, all fixes should be in place now

@ArangoGutierrez There is still the same issue now.

Hi @rupang790 @lgc0313 is this still an issue?
if so, would you let me know the versions of your cluster, and version of the operatorhub?

Hi, @ArangoGutierrez
on my OKD Cluster (4.6.0-0.okd-2021-02-14-205305) It is installed well through operatorHub.
I used NFD version 4.7 and It works for me. For @lgc0313, I will not close this issue.

However, if I would like to install Special-Resource-operator, should I delete NFD-operator for it?
(Because It saw Special-Resource-Operator install NFD itself again.)

Thank you for fix the issue.

for SRO let's ask @dagrayvid

Hi @rupang790, we shouldn't need to uninstall NFD before installing SRO. I think the reason that it would have been installing NFD again is because SRO's dependency on NFD was out-of-date and looking for an NFD version older than 4.7, hence it was installing 4.5 or 4.6 in addition to the already installed NFD 4.7. This was updated last Friday in communityoperators so it should now work with NFD 4.7 already installed.

Let me know if you have any questions or are still running into this issue!

@dagrayvid, I tried to install SRO on my cluster but it seems not installing status.
I used OperatorHub to install and check NFD version as 4.7.
image
image

I can see that it created service only (no deployments or daemonset about operators)

Every 1.0s: oc get all -n openshift-operators                                                                                                                                                          okd-bastion01: Tue Apr 20 08:21:27 2021

NAME                                READY   STATUS    RESTARTS   AGE
pod/nfd-master-2xk2r                1/1     Running   0          31h
pod/nfd-master-qzxs2                1/1     Running   0          31h
pod/nfd-master-w5t5c                1/1     Running   0          31h
pod/nfd-operator-576d77d47f-r9qrf   1/1     Running   0          31h
pod/nfd-worker-2hc2f                1/1     Running   0          31h
pod/nfd-worker-4jrs8                1/1     Running   0          31h
pod/nfd-worker-4q8jt                1/1     Running   0          31h
pod/nfd-worker-k5xmz                1/1     Running   0          31h

NAME                                                          TYPE        CLUSTER-IP	   EXTERNAL-IP   PORT(S)     AGE
service/nfd-master                                            ClusterIP   172.30.101.100   <none>        12000/TCP   31h
service/special-resource-controller-manager-metrics-service   ClusterIP   172.30.200.31    <none>        8443/TCP    10m

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/nfd-master   3         3         3	3            3           node-role.kubernetes.io/master=   31h
daemonset.apps/nfd-worker   4         4         4	4            4           <none>                            31h

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nfd-operator   1/1     1            1           31h

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/nfd-operator-576d77d47f   1         1         1       31h

NAME                                                              AGE
vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged   39d

Do there have any logs about installation of operators?
One more thing is, Is SRO could be installed when GPU device existed on Node?

If I should create new Issue about this on SRO GitHub, please tell me.
Thank you for comment.

Hi @rupang790 I haven't seen this error before, so I will have to investigate. Please open an issue on the upstream SRO GitHub to track this.

Is this on OKD? Having GPU devices on the node should not cause any problems.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

I am trying install NFD 4.7 on OCP 4.6.26, but getting this error

image

@rupang790 @dagrayvid were you able to get this working?

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.