FAILED - RETRYING: Wait for control plane pods to appear
Closed this issue · 22 comments
Description
I'm trying a new installation of MASTER branch and v3.10.0 ,it's failing when installing the Master with:
Control plane install failed.
Version
Ansible: ansible 2.6.2
openshift_release=v3.10.0
openshift_image_tag=v3.10.0
openshift_pkg_version=-3.10.0-1.el7.git.0.0c4577e
RPM: package openshift-ansible is not installed
Steps To Reproduce
Follow all pre-requisits
git clone https://github.com/openshift/openshift-ansible
cd openshift-ansible
ansible-playbook playbooks/prerequisites.yml
ansible-playbook playbooks/deploy_cluster.yml
TASK [openshift_control_plane : Wait for control plane pods to appear] ************************************************************************************************************************************************************
Tuesday 14 August 2018 16:39:24 +0800 (0:00:00.086) 0:22:42.301 ********
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
...............
FAILED - RETRYING: Wait for control plane pods to appear (1 retries left).
failed: [10.10.244.212] (item=__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251) => {"attempts": 60, "changed": false, "item": "__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251", "msg": {"cmd": "/bin/oc get pod master-__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251-10.10.244.212 -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server 10.10.244.212:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
journalctl -flu docker.service
Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: F0814 08:46:15.849128 1 start_api.go:68] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct
Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:15.911550511+08:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 30483e504b05f46127fb81b73dab375fb5429096535b0611a07bcdae7505a25c does not exist\\none or more of the container deletions failed\\n\""
Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:15.924794132+08:00" level=warning msg="30483e504b05f46127fb81b73dab375fb5429096535b0611a07bcdae7505a25c cleanup: failed to unmount secrets: invalid argument"
Aug 14 16:46:18 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:18.576959765+08:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 75cfd311e6f33a696b4935b380294e3f6158723a9352357f8aaff3b9da14d31f"
We have same issue deploy origin-3.10 with openshift-ansible-3.10.27 rpm installer
@jhaohai @aland-zhang tomorrow morning after 09:00 UTC there should be a new rpm as mentioned here, can you please give that a go and let us know ?
Between openshift-ansible-3.10.27-1
& openshift-ansible-3.10.28-1
few fixes got in which i think it touched the parsing error
@DanyC97 Thanks, but I tried a fresh installation and this issue still exist.
The audit config from example inventory
openshift_master_audit_config={"enabled": "true"}
will generated as below
auditConfig:
enabled: 'true'
@DanyC97 I installed openshift-ansible-3.10.28-1.git.0.9242c73.noarch.rpm and Still showing the same error :
TASK [openshift_control_plane : Report control plane errors] **********************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}
Failure summary:
1. Hosts: 10.10.244.212
Play: Configure masters
Task: Report control plane errors
Message: Control plane pods didn't come up
These dockers have been pulled and completed:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/openshift/origin-node v3.10.0 9f1821bc44c6 35 hours ago 1.27 GB
docker.io/openshift/origin-control-plane v3.10.0 7454912cb385 35 hours ago 816 MB
docker.io/openshift/origin-pod v3.10.0 105d0a0070c3 36 hours ago 223 MB
docker.io/openshift/origin-web-console v3.10.0 7098d8b0a928 13 days ago 337 MB
registry.access.redhat.com/rhel7/etcd latest bb2f1d4dd3a7 7 weeks ago 256 MB
I get https://github.com/openshift/openshift-ansible/archive/openshift-ansible-3.10.29-1.zip and Still showing the same error!
@aland-zhang i wonder if you are suffering from this issue #9620
I didn't have this issue #9620,I encountered "Control plane pods didn't come up" error and install stopped.
@aland-zhang The detailed logs are in failed pods. Checkout the log with
docker ps -a
docker logs -f --tail 10 <container-id>
@DanyC97 tail docker k8s_api_master-api-10-10-244-212_kube-system_9ca23c5815da8ed1d3dca61d87e1f6ab_33
docker logs -f --tail 10 cc799619cbe4
F0816 07:06:26.568393 1 start_api.go:68] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct
tail docker k8s_controllers_master-controllers-10-10-244-212_kube-system_360556e3b47fddd253ba2104e990214f_33
docker logs -f --tail 10 f5b3cbaba0e8
F0816 07:06:11.572360 1 start_controllers.go:67] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct
cat /etc/origin/master/master-config.yaml
admissionConfig:
pluginConfig:
BuildDefaults:
configuration:
apiVersion: v1
env: []
kind: BuildDefaultsConfig
resources:
limits: {}
requests: {}
BuildOverrides:
configuration:
apiVersion: v1
kind: BuildOverridesConfig
openshift.io/ImagePolicy:
configuration:
apiVersion: v1
executionRules:
- matchImageAnnotations:
- key: images.openshift.io/deny-execution
value: 'true'
name: execution-denied
onResources:
- resource: pods
- resource: builds
reject: true
skipOnResolutionFailure: true
kind: ImagePolicyConfig
aggregatorConfig:
proxyClientInfo:
certFile: aggregator-front-proxy.crt
keyFile: aggregator-front-proxy.key
apiLevels:
- v1
apiVersion: v1
auditConfig: '{"enabled": true, "auditFilePath": "/data/log/openpaas-oscp-audit/openpaas-oscp-audit.log",
"maximumFileRetentionDays": 14, "maximumFileSizeMegabytes": 500, "maximumRetainedFiles":
5}'
authConfig:
requestHeader:
clientCA: front-proxy-ca.crt
clientCommonNames:
- aggregator-front-proxy
extraHeaderPrefixes:
- X-Remote-Extra-
groupHeaders:
- X-Remote-Group
usernameHeaders:
- X-Remote-User
controllerConfig:
election:
lockName: openshift-master-controllers
serviceServingCert:
signer:
certFile: service-signer.crt
keyFile: service-signer.key
controllers: '*'
corsAllowedOrigins:
- (?i)//127\.0\.0\.1(:|\z)
- (?i)//localhost(:|\z)
- (?i)//10\.10\.244\.212(:|\z)
- (?i)//kubernetes\.default(:|\z)
- (?i)//kubernetes\.default\.svc\.cluster\.local(:|\z)
- (?i)//kubernetes(:|\z)
- (?i)//openshift\.default\.svc(:|\z)
- (?i)//openshift\.default(:|\z)
- (?i)//172\.30\.0\.1(:|\z)
- (?i)//openshift\.default\.svc\.cluster\.local(:|\z)
- (?i)//kubernetes\.default\.svc(:|\z)
- (?i)//openshift(:|\z)
- (?i)//\*(:|\z)
dnsConfig:
bindAddress: 0.0.0.0:8053
bindNetwork: tcp4
etcdClientInfo:
ca: master.etcd-ca.crt
certFile: master.etcd-client.crt
keyFile: master.etcd-client.key
urls:
- https://10.10.234.215:2379
etcdStorageConfig:
kubernetesStoragePrefix: kubernetes.io
kubernetesStorageVersion: v1
openShiftStoragePrefix: openshift.io
openShiftStorageVersion: v1
imageConfig:
format: docker.io/openshift/origin-${component}:${version}
latest: false
imagePolicyConfig:
disableScheduledImport: true
internalRegistryHostname: docker-registry.default.svc:5000
maxImagesBulkImportedPerRepository: 3
kind: MasterConfig
kubeletClientInfo:
ca: ca-bundle.crt
certFile: master.kubelet-client.crt
keyFile: master.kubelet-client.key
port: 10250
kubernetesMasterConfig:
apiServerArguments:
feature-gates:
- PersistentLocalVolumes=true
- VolumeScheduling=true
storage-backend:
- etcd3
storage-media-type:
- application/vnd.kubernetes.protobuf
controllerArguments:
cluster-signing-cert-file:
- /etc/origin/master/ca.crt
cluster-signing-key-file:
- /etc/origin/master/ca.key
feature-gates:
- PersistentLocalVolumes=true
- VolumeScheduling=true
masterCount: 1
masterIP: 10.10.244.212
podEvictionTimeout: 5m
proxyClientInfo:
certFile: master.proxy-client.crt
keyFile: master.proxy-client.key
schedulerArguments: null
schedulerConfigFile: /etc/origin/master/scheduler.json
servicesNodePortRange: 30000-50000
servicesSubnet: 172.30.0.0/16
staticNodeNames: []
masterClients:
externalKubernetesClientConnectionOverrides:
acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
burst: 400
contentType: application/vnd.kubernetes.protobuf
qps: 200
externalKubernetesKubeConfig: ''
openshiftLoopbackClientConnectionOverrides:
acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
burst: 600
contentType: application/vnd.kubernetes.protobuf
qps: 300
openshiftLoopbackKubeConfig: openshift-master.kubeconfig
masterPublicURL: https://10.10.244.212:8443
networkConfig:
clusterNetworks:
- cidr: 10.128.0.0/8
hostSubnetLength: 8
externalIPNetworkCIDRs:
- 0.0.0.0/0
ingressIPNetworkCIDR: 172.46.0.0/16
networkPluginName: redhat/openshift-ovs-multitenant
serviceNetworkCIDR: 172.30.0.0/16
oauthConfig:
assetPublicURL: https://10.10.244.212:8443/console/
grantConfig:
method: auto
identityProviders:
- challenge: true
login: true
mappingMethod: claim
name: htpasswd_auth
provider:
apiVersion: v1
file: /etc/origin/master/htpasswd
kind: HTPasswdPasswordIdentityProvider
masterCA: ca-bundle.crt
masterPublicURL: https://10.10.244.212:8443
masterURL: https://10.10.244.212:8443
sessionConfig:
sessionMaxAgeSeconds: 3600
sessionName: ssn
sessionSecretsFile: /etc/origin/master/session-secrets.yaml
tokenConfig:
accessTokenMaxAgeSeconds: 86400
authorizeTokenMaxAgeSeconds: 86400
pauseControllers: false
policyConfig:
bootstrapPolicyFile: /etc/origin/master/policy.json
openshiftInfrastructureNamespace: openshift-infra
openshiftSharedResourcesNamespace: openshift
projectConfig:
defaultNodeSelector: node-role.kubernetes.io/compute=true
projectRequestMessage: ''
projectRequestTemplate: ''
securityAllocator:
mcsAllocatorRange: s0:/2
mcsLabelsPerProject: 5
uidAllocatorRange: 1000000000-1999999999/10000
routingConfig:
subdomain: svc.cluster.local
serviceAccountConfig:
limitSecretReferences: false
managedNames:
- default
- builder
- deployer
masterCA: ca-bundle.crt
privateKeyFile: serviceaccounts.private.key
publicKeyFiles:
- serviceaccounts.public.key
servingInfo:
bindAddress: 0.0.0.0:8443
bindNetwork: tcp4
certFile: master.server.crt
clientCA: ca.crt
keyFile: master.server.key
maxRequestsInFlight: 500
requestTimeoutSeconds: 3600
volumeConfig:
dynamicProvisioningEnabled: true
@aland-zhang https://bugzilla.redhat.com/show_bug.cgi?id=1589063#c3 there is same issue in bugzilla
Hi All,
I noticed the same behavior with the openshift-ansible-3.10.33-1.
On master nodes the pod k8s_api_master-api... is exited and the problem is related to a missing connectivity to etcd.
[root@ocp-devmaster01 openshift-ansible]# docker logs -f --tail=10 750c9e7631eb
I0822 06:45:20.892219 1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I0822 06:45:20.892233 1 plugins.go:84] Registered admission plugin "ResourceQuota"
I0822 06:45:20.892246 1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I0822 06:45:20.892258 1 plugins.go:84] Registered admission plugin "Priority"
I0822 06:45:20.892273 1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I0822 06:45:20.892286 1 plugins.go:84] Registered admission plugin "ServiceAccount"
I0822 06:45:20.892305 1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I0822 06:45:20.892318 1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I0822 06:45:20.892334 1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
F0822 06:45:50.897544 1 start_api.go:68] dial tcp 192.168.98.208:2379: getsockopt: connection refused
The etcd container isn't running on the master nodes.
Marcello
Nearly but not quite the same problem here. Dumping the log of the master api
image shows a file is missing. Causing the service to fail on startup.
# docker logs -f --tail 10 k8s_api_master-api-server7_kube-system_5946c1f644096161a1242b3de0ee5875_3085
Invalid MasterConfig /etc/origin/master/master-config.yaml
oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory
#
Inspecting the logs for the master controller
image shows the same missing file as well.
# docker logs -f --tail 10 k8s_controllers_master-controllers-benchserver7_kube-system_8e879171c85e221fb0a023e3f10ca276_3084
Invalid MasterConfig /etc/origin/master/master-config.yaml
oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory
#
Dropping into the master api
image to double check the missing file shows the directory the file should be located in is missing
# docker run -it --entrypoint /bin/bash registry.access.redhat.com/openshift3/ose-control-plane
Unable to find image 'registry.access.redhat.com/openshift3/ose-control-plane:latest' locally
Trying to pull repository registry.access.redhat.com/openshift3/ose-control-plane ...
latest: Pulling from registry.access.redhat.com/openshift3/ose-control-plane
Digest: sha256:fb05ab7dad76f91201660824a5e88d4e17989fb2ef34ce0522eafd7604cf41f0
Status: Downloaded newer image for registry.access.redhat.com/openshift3/ose-control-plane:latest
[root@0f1f30e36bfd origin]# ls /etc/origin
ls: cannot access /etc/origin: No such file or directory
[root@0f1f30e36bfd origin]#
Shouldn't the image(s) contain this directory ?
# git describe
openshift-ansible-3.10.27-2-171-g481000a0b
oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory
Looks like your openshift_master_identity_providers
has incorrect path to htpasswd. Is this file present on the host?
Looks like your openshift_master_identity_providers has incorrect path to htpasswd. Is this file present on the host?
yes
Does it work if its moved to /etc/origin/master
?
Correction. My last comment was incorrect. The file is not present on the host.
Folks, 'Wait for control plane pods to appear' failing means API server failed to start. There might be a billion reasons for that - unreachable pod image pullspec, wrong API server configuration, something wring with docker etc.
Lets not post 'I have this issue too' comments, because the same symptom doesn't mean the same issue is causing it - or that ansible playbooks are incorrect.
facing the same issue...i am suspecting that its a DNS issue..i checked before and after the config.yml script executiion that DNS name is changing...
[root@master ~]# cat /etc/resolv.conf
nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
Generated by NetworkManager
search cluster.local example.com
nameserver 10.0.0.11
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0|grep -i DNS
DNS1=10.0.0.1
Openshift error:
TASK [openshift_control_plane : Report control plane errors] **************************************************************************************
fatal: [master.example.com]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}
fatal: [master.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "get", "events", "--config=/etc/origin/master/admin.kubeconfig", "-n", "kube-system"], "delta": "0:00:00.208492", "end": "2018-10-15 16:36:02.875601", "msg": "non-zero return code", "rc": 1, "start": "2018-10-15 16:36:02.667109", "stderr": "The connection to the server master.example.com:8443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server master.example.com:8443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
Please suggest any resolution.
- Make sure the hostname command on your host give FQDN, if not set it with
hostnamectl set-hostname <FQDN hostname>
Add below properties in /etc/sysconfig/network-scripts/ifcfg-<interface_name>
NM_CONTROLLED=yes
PEERDNS=yes
Also to fix upstream DNS quickly you can follow below suppose your DNS addresses are 10.99.1.2 and 10.99.1.3
# nmcli con mod ens192 ipv4.dns 10.99.1.2,10.99.1.3
# systemctl restart NetworkManager
# systemctl restart dnsmasq
# cat /etc/dnsmasq.d/origin-upstream-dns.conf
# cat /etc/resolv.conf
The below fixed my issue. I use a proxy in my environment. I had to add the hostname to no_proxy
$ cat < /etc/environment
http_proxy=http://10.xx.xx.xx:8080
https_proxy=http://10.xx.xx.xx:8080
ftp_proxy=http://10.xx.xx.xx:8080
no_proxy=127.0.0.1,localhost,172.17.240.84,172.17.240.85,172.17.240.86,172.17.240.87,10.96.0.0/12,10.244.0.0/16,v-openshift1-lnx1,v-node01-lnx1,v-node02-lnx1,console,console.inet.co.za
EOF
$ cat < /etc/systemd/system/docker.service.d/no-proxy.conf
[Service]
Environment="NO_PROXY=artifactory-za.devel.iress.com.au, 172.30.9.71, 172.17.240.84, 172.17.240.85, 172.17.240.86, 172.17.240.87"
Environment="HTTP_PROXY=http://10.xx.xx.xx:8080/"
Environment="HTTPS_PROXY=http://10.xx.xx.xx:8080/"
EOF