slauger/hcloud-okd4

master bootstrapping failing - no admitted ingress for route

Closed this issue · 2 comments

Hello,

it seems like the installation process is failing at the master bootstrapping step.
Any idea what could cause this issue of a master node not willing to complete bootstrapping? (Left it running for > 1day and it still wasn't "finished")

I do appreciate any help!

Container logs of the master node are attached
master-container-logs.zip

bash-5.1# cat install-config.yaml
apiVersion: v1
baseDomain: k8s.hnbg.elsysweyr.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}

  replicas: 3
metadata:
  creationTimestamp: null
  name: prod-hnbg-public-services
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.200.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
publish: External
pullSecret: '*pullsecret removed*'
sshKey: |
 '*sshkeys removed*' 

This command below seems to be hanging without returning.

[core@master01 ~]$ journalctl -b -f -u bootkube.service

No result for this command:

[core@master01 ~]$ for pod in $(sudo podman ps -a -q); do sudo podman logs $pod; done
[core@master01 ~]$ 

Installation CLI log:

openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504 
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930 
INFO Waiting up to 20m0s (until 9:28PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443... 
DEBUG Loading Agent Config...                      
DEBUG Still waiting for the Kubernetes API: Get "https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443/version": EOF 
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up     
DEBUG Loading Install Config...                    
DEBUG   Loading SSH Key...                         
DEBUG   Loading Base Domain...                     
DEBUG     Loading Platform...                      
DEBUG   Loading Cluster Name...                    
DEBUG     Loading Base Domain...                   
DEBUG     Loading Platform...                      
DEBUG   Loading Networking...                      
DEBUG     Loading Platform...                      
DEBUG   Loading Pull Secret...                     
DEBUG   Loading Platform...                        
DEBUG Using Install Config loaded from state file  
INFO Waiting up to 30m0s (until 9:41PM) for bootstrapping to complete... 

E0319 21:22:25.249802    6491 reflector.go:140] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get "https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=11315&timeoutSeconds=529&watch=true": http2: client connection lost - error from a previous attempt: unexpected EOF
W0319 21:23:41.366132    6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

W0319 21:34:24.516370    6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0319 21:36:11.202971    6491 reflector.go:347] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server 
ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route oauth-openshift in namespace openshift-authentication 
ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication 
ERROR OAuthServerDeploymentDegraded:               
ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address 
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.103.225:443/healthz": dial tcp 172.30.103.225:443: connect: connection refused 
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready 
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) 
ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.103.225:443/healthz": dial tcp 172.30.103.225:443: connect: connection refused 
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found 
ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). 
ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) 
INFO Cluster operator baremetal Disabled is False with :  
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected 
ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route console in namespace openshift-console 
ERROR RouteHealthDegraded: console route is not admitted 
ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com in route console in namespace openshift-console 
ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted 
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required 
ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) 
INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available... 
INFO ).                                           
INFO Not all ingress controllers are available.   
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-756d8b77f9-qxqhm" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. Pod "router-default-756d8b77f9-g777c" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) 
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:  
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer 
INFO Cluster operator insights Disabled is False with AsExpected:  
INFO Cluster operator insights SCAAvailable is Unknown with :  
ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host 
ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas 
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas 
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
INFO Cluster operator network ManagementStateDegraded is False with :  
INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready 
ERROR Cluster operator operator-lifecycle-manager-packageserver Available is False with ClusterServiceVersionNotSucceeded: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install failed: deployment packageserver not ready before timeout: deployment "packageserver" exceeded its progress deadline 
INFO Use the following commands to gather logs from the cluster 
INFO openshift-install gather bootstrap --help    
ERROR Bootstrap failed to complete: timed out waiting for the condition 
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. 
make: *** [Makefile:99: wait_bootstrap] Error 5
bash-5.1# 
bash-5.1# make wait_bootstrap
openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504 
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930 
INFO Waiting up to 20m0s (until 10:02PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443... 
DEBUG Loading Agent Config...                      
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up     
DEBUG Loading Install Config...                    
DEBUG   Loading SSH Key...                         
DEBUG   Loading Base Domain...                     
DEBUG     Loading Platform...                      
DEBUG   Loading Cluster Name...                    
DEBUG     Loading Base Domain...                   
DEBUG     Loading Platform...                      
DEBUG   Loading Networking...                      
DEBUG     Loading Platform...                      
DEBUG   Loading Pull Secret...                     
DEBUG   Loading Platform...                        
DEBUG Using Install Config loaded from state file  
INFO Waiting up to 30m0s (until 10:12PM) for bootstrapping to complete... 
^Cmake: *** [Makefile:99: wait_bootstrap] Interrupt

bash-5.1# make wait_bootstrap
openshift-install --dir=ignition/ wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.12.0-0.okd-2023-03-05-022504 
DEBUG Built from commit 7c2530226516a12c37f10bc14e070f66c0f27930 
INFO Waiting up to 20m0s (until 10:03PM) for the Kubernetes API at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443... 
DEBUG Loading Agent Config...                      
INFO API v1.25.0-2655+18eadcaadf0be7-dirty up     
DEBUG Loading Install Config...                    
DEBUG   Loading SSH Key...                         
DEBUG   Loading Base Domain...                     
DEBUG     Loading Platform...                      
DEBUG   Loading Cluster Name...                    
DEBUG     Loading Base Domain...                   
DEBUG     Loading Platform...                      
DEBUG   Loading Networking...                      
DEBUG     Loading Platform...                      
DEBUG   Loading Pull Secret...                     
DEBUG   Loading Platform...                        
DEBUG Using Install Config loaded from state file  
INFO Waiting up to 30m0s (until 10:13PM) for bootstrapping to complete... 

Performing this command on the extracted logs should give you a good overview:

[core@master01 ~]$ sudo tail -f /var/log/containers/* | grep -e "\(error\|fail\)"
INFO Waiting up to 40m0s (until 11:29AM) for the cluster at https://api.prod-hnbg-public-services.k8s.hnbg.elsysweyr.com:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: 
DEBUG * Could not update imagestream "openshift/driver-toolkit" (553 of 810): the server is down or not responding 
DEBUG * Could not update oauthclient "console" (499 of 810): the server does not recognize this resource, check extension API servers 
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (730 of 810): resource may have been deleted 
DEBUG * Could not update role "openshift-console/prometheus-k8s" (733 of 810): resource may have been deleted 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 1 of 810 done (0% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 25 of 810 done (3% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 53 of 810 done (6% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 54 of 810 done (6% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 582 of 810 done (71% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 583 of 810 done (71% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 784 of 810 done (96% complete) 

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.okd-2022-10-15-073651: 786 of 810 done (97% complete

Once you realize that you need to provision the correct amount of nodes using terraform it starts to work. :D
Embarassing but I'm happy it's working! Thanks for your great work!

Hi @elysweyr,

good to hear that you could the issue by yourself. I haven't had much time for this project lately - so it's great to know that everything still works with OKD 4.11. 👍