master bootstrapping failing - no admitted ingress for route
Closed this issue · 2 comments
elysweyr commented
Hello,
it seems like the installation process is failing at the master bootstrapping step.
Any idea what could cause this issue of a master node not willing to complete bootstrapping? (Left it running for > 1day and it still wasn't "finished")
I do appreciate any help!
Container logs of the master node are attached
master-container-logs.zip
bash-5.1# cat install-config.yaml
apiVersion: v1
baseDomain: k8s.hnbg.elsysweyr.com
compute:
- architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
metadata:
creationTimestamp: null
name: prod-hnbg-public-services
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineNetwork:
- cidr: 10.200.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
platform:
none: {}
publish: External
pullSecret: '*pullsecret removed*'
sshKey: |
'*sshkeys removed*'
This command below seems to be hanging without returning.
[core@master01 ~]$ journalctl -b -f -u bootkube.service
No result for this command:
[core@master01 ~]$ for pod in $(sudo podman ps -a -q); do sudo podman logs $pod; done
[core@master01 ~]$
Installation CLI log:
openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930
INFO Waiting up to 20m0s (until 9:28PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443...
DEBUG Loading Agent Config...
DEBUG Still waiting for the Kubernetes API: Get "https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443/version": EOF
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up
DEBUG Loading Install Config...
DEBUG Loading SSH Key...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Cluster Name...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Networking...
DEBUG Loading Platform...
DEBUG Loading Pull Secret...
DEBUG Loading Platform...
DEBUG Using Install Config loaded from state file
INFO Waiting up to 30m0s (until 9:41PM) for bootstrapping to complete...
E0319 21:22:25.249802 6491 reflector.go:140] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get "https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=11315&timeoutSeconds=529&watch=true": http2: client connection lost - error from a previous attempt: unexpected EOF
W0319 21:23:41.366132 6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0319 21:34:24.516370 6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0319 21:36:11.202971 6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded:
ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.103.225:443/healthz": dial tcp 172.30.103.225:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.103.225:443/healthz": dial tcp 172.30.103.225:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
INFO Cluster operator baremetal Disabled is False with :
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route console in namespace openshift-console
ERROR RouteHealthDegraded: console route is not admitted
ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route console in namespace openshift-console
ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available...
INFO ).
INFO Not all ingress controllers are available.
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-756d8b77f9-qxqhm" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. Pod "router-default-756d8b77f9-g777c" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCAAvailable is Unknown with :
ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
ERROR Cluster operator operator-lifecycle-manager-packageserver Available is False with ClusterServiceVersionNotSucceeded: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install failed: deployment packageserver not ready before timeout: deployment "packageserver" exceeded its progress deadline
INFO Use the following commands to gather logs from the cluster
INFO openshift-install gather bootstrap --help
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
make: *** [Makefile:99: wait_bootstrap] Error 5
bash-5.1#
bash-5.1# make wait_bootstrap
openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930
INFO Waiting up to 20m0s (until 10:02PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443...
DEBUG Loading Agent Config...
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up
DEBUG Loading Install Config...
DEBUG Loading SSH Key...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Cluster Name...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Networking...
DEBUG Loading Platform...
DEBUG Loading Pull Secret...
DEBUG Loading Platform...
DEBUG Using Install Config loaded from state file
INFO Waiting up to 30m0s (until 10:12PM) for bootstrapping to complete...
^Cmake: *** [Makefile:99: wait_bootstrap] Interrupt
bash-5.1# make wait_bootstrap
openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930
INFO Waiting up to 20m0s (until 10:03PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443...
DEBUG Loading Agent Config...
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up
DEBUG Loading Install Config...
DEBUG Loading SSH Key...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Cluster Name...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Networking...
DEBUG Loading Platform...
DEBUG Loading Pull Secret...
DEBUG Loading Platform...
DEBUG Using Install Config loaded from state file
INFO Waiting up to 30m0s (until 10:13PM) for bootstrapping to complete...
Performing this command on the extracted logs should give you a good overview:
[core@master01 ~]$ sudo tail -f /var/log/containers/* | grep -e "\(error\|fail\)"
elysweyr commented
INFO Waiting up to 40m0s (until 11:29AM) for the cluster at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443 to initialize...
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
DEBUG * Could not update imagestream "openshift/driver-toolkit" (553 of 810): the server is down or not responding
DEBUG * Could not update oauthclient "console" (499 of 810): the server does not recognize this resource, check extension API servers
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (730 of 810): resource may have been deleted
DEBUG * Could not update role "openshift-console/prometheus-k8s" (733 of 810): resource may have been deleted
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 1 of 810 done (0% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 25 of 810 done (3% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 53 of 810 done (6% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 54 of 810 done (6% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 582 of 810 done (71% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 583 of 810 done (71% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 784 of 810 done (96% complete)
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, monitoring
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 786 of 810 done (97% complete
Once you realize that you need to provision the correct amount of nodes using terraform it starts to work. :D
Embarassing but I'm happy it's working! Thanks for your great work!