openshift/openshift-ansible

FAILED - RETRYING: Wait for control plane pods to appear

Closed this issue · 22 comments

Description
I'm trying a new installation of MASTER branch and v3.10.0 ,it's failing when installing the Master with:

Control plane install failed.
Version
Ansible: ansible 2.6.2
openshift_release=v3.10.0
openshift_image_tag=v3.10.0
openshift_pkg_version=-3.10.0-1.el7.git.0.0c4577e
RPM: package openshift-ansible is not installed
Steps To Reproduce
Follow all pre-requisits
git clone https://github.com/openshift/openshift-ansible
cd openshift-ansible
ansible-playbook playbooks/prerequisites.yml
ansible-playbook playbooks/deploy_cluster.yml

TASK [openshift_control_plane : Wait for control plane pods to appear] ************************************************************************************************************************************************************
Tuesday 14 August 2018  16:39:24 +0800 (0:00:00.086)       0:22:42.301 ******** 
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
...............
FAILED - RETRYING: Wait for control plane pods to appear (1 retries left).
failed: [10.10.244.212] (item=__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251) => {"attempts": 60, "changed": false, "item": "__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251", "msg": {"cmd": "/bin/oc get pod master-__omit_place_holder__5e245b7f796113e2f9ba55e6c4a882ef0471a251-10.10.244.212 -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server 10.10.244.212:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

journalctl -flu docker.service

Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: F0814 08:46:15.849128       1 start_api.go:68] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct
Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:15.911550511+08:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 30483e504b05f46127fb81b73dab375fb5429096535b0611a07bcdae7505a25c does not exist\\none or more of the container deletions failed\\n\""
Aug 14 16:46:15 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:15.924794132+08:00" level=warning msg="30483e504b05f46127fb81b73dab375fb5429096535b0611a07bcdae7505a25c cleanup: failed to unmount secrets: invalid argument"
Aug 14 16:46:18 10-10-244-212 dockerd-current[26428]: time="2018-08-14T16:46:18.576959765+08:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 75cfd311e6f33a696b4935b380294e3f6158723a9352357f8aaff3b9da14d31f"

We have same issue deploy origin-3.10 with openshift-ansible-3.10.27 rpm installer

@jhaohai @aland-zhang tomorrow morning after 09:00 UTC there should be a new rpm as mentioned here, can you please give that a go and let us know ?

Between openshift-ansible-3.10.27-1 & openshift-ansible-3.10.28-1 few fixes got in which i think it touched the parsing error

@DanyC97 Thanks, but I tried a fresh installation and this issue still exist.
The audit config from example inventory
openshift_master_audit_config={"enabled": "true"}
will generated as below

auditConfig:
  enabled: 'true'

@DanyC97 I installed openshift-ansible-3.10.28-1.git.0.9242c73.noarch.rpm and Still showing the same error :

TASK [openshift_control_plane : Report control plane errors] **********************************************************************
fatal: [10.10.244.212]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}
Failure summary:
  1. Hosts:    10.10.244.212
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

These dockers have been pulled and completed:

$ docker images                                                                                  
REPOSITORY                                 TAG                 IMAGE ID            CREATED             SIZE
docker.io/openshift/origin-node            v3.10.0             9f1821bc44c6        35 hours ago        1.27 GB
docker.io/openshift/origin-control-plane   v3.10.0             7454912cb385        35 hours ago        816 MB
docker.io/openshift/origin-pod             v3.10.0             105d0a0070c3        36 hours ago        223 MB
docker.io/openshift/origin-web-console     v3.10.0             7098d8b0a928        13 days ago         337 MB
registry.access.redhat.com/rhel7/etcd      latest              bb2f1d4dd3a7        7 weeks ago         256 MB

I get https://github.com/openshift/openshift-ansible/archive/openshift-ansible-3.10.29-1.zip and Still showing the same error!

@aland-zhang i wonder if you are suffering from this issue #9620

I didn't have this issue #9620,I encountered "Control plane pods didn't come up" error and install stopped.

@aland-zhang The detailed logs are in failed pods. Checkout the log with

docker ps -a
docker logs -f --tail 10 <container-id>

@DanyC97 tail docker k8s_api_master-api-10-10-244-212_kube-system_9ca23c5815da8ed1d3dca61d87e1f6ab_33

 docker logs -f --tail 10 cc799619cbe4
F0816 07:06:26.568393       1 start_api.go:68] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct

tail docker k8s_controllers_master-controllers-10-10-244-212_kube-system_360556e3b47fddd253ba2104e990214f_33

docker logs -f --tail 10 f5b3cbaba0e8           
F0816 07:06:11.572360       1 start_controllers.go:67] could not load config file "/etc/origin/master/master-config.yaml" due to an error: error reading config: only encoded map or array can be decoded into a struct

cat /etc/origin/master/master-config.yaml

admissionConfig:
  pluginConfig:
    BuildDefaults:
      configuration:
        apiVersion: v1
        env: []
        kind: BuildDefaultsConfig
        resources:
          limits: {}
          requests: {}
    BuildOverrides:
      configuration:
        apiVersion: v1
        kind: BuildOverridesConfig
    openshift.io/ImagePolicy:
      configuration:
        apiVersion: v1
        executionRules:
        - matchImageAnnotations:
          - key: images.openshift.io/deny-execution
            value: 'true'
          name: execution-denied
          onResources:
          - resource: pods
          - resource: builds
          reject: true
          skipOnResolutionFailure: true
        kind: ImagePolicyConfig
aggregatorConfig:
  proxyClientInfo:
    certFile: aggregator-front-proxy.crt
    keyFile: aggregator-front-proxy.key
apiLevels:
- v1
apiVersion: v1
auditConfig: '{"enabled": true, "auditFilePath": "/data/log/openpaas-oscp-audit/openpaas-oscp-audit.log",
  "maximumFileRetentionDays": 14, "maximumFileSizeMegabytes": 500, "maximumRetainedFiles":
  5}'
authConfig:
  requestHeader:
    clientCA: front-proxy-ca.crt
    clientCommonNames:
    - aggregator-front-proxy
    extraHeaderPrefixes:
    - X-Remote-Extra-
    groupHeaders:
    - X-Remote-Group
    usernameHeaders:
    - X-Remote-User
controllerConfig:
  election:
    lockName: openshift-master-controllers
  serviceServingCert:
    signer:
      certFile: service-signer.crt
      keyFile: service-signer.key
controllers: '*'
corsAllowedOrigins:
- (?i)//127\.0\.0\.1(:|\z)
- (?i)//localhost(:|\z)
- (?i)//10\.10\.244\.212(:|\z)
- (?i)//kubernetes\.default(:|\z)
- (?i)//kubernetes\.default\.svc\.cluster\.local(:|\z)
- (?i)//kubernetes(:|\z)
- (?i)//openshift\.default\.svc(:|\z)
- (?i)//openshift\.default(:|\z)
- (?i)//172\.30\.0\.1(:|\z)
- (?i)//openshift\.default\.svc\.cluster\.local(:|\z)
- (?i)//kubernetes\.default\.svc(:|\z)
- (?i)//openshift(:|\z)
- (?i)//\*(:|\z)
dnsConfig:
  bindAddress: 0.0.0.0:8053
  bindNetwork: tcp4
etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
  - https://10.10.234.215:2379
etcdStorageConfig:
  kubernetesStoragePrefix: kubernetes.io
  kubernetesStorageVersion: v1
  openShiftStoragePrefix: openshift.io
  openShiftStorageVersion: v1
imageConfig:
  format: docker.io/openshift/origin-${component}:${version}
  latest: false
imagePolicyConfig:
  disableScheduledImport: true
  internalRegistryHostname: docker-registry.default.svc:5000
  maxImagesBulkImportedPerRepository: 3
kind: MasterConfig
kubeletClientInfo:
  ca: ca-bundle.crt
  certFile: master.kubelet-client.crt
  keyFile: master.kubelet-client.key
  port: 10250
kubernetesMasterConfig:
  apiServerArguments:
    feature-gates:
    - PersistentLocalVolumes=true
    - VolumeScheduling=true
    storage-backend:
    - etcd3
    storage-media-type:
    - application/vnd.kubernetes.protobuf
  controllerArguments:
    cluster-signing-cert-file:
    - /etc/origin/master/ca.crt
    cluster-signing-key-file:
    - /etc/origin/master/ca.key
    feature-gates:
    - PersistentLocalVolumes=true
    - VolumeScheduling=true
  masterCount: 1
  masterIP: 10.10.244.212
  podEvictionTimeout: 5m
  proxyClientInfo:
    certFile: master.proxy-client.crt
    keyFile: master.proxy-client.key
  schedulerArguments: null
  schedulerConfigFile: /etc/origin/master/scheduler.json
  servicesNodePortRange: 30000-50000
  servicesSubnet: 172.30.0.0/16
  staticNodeNames: []
masterClients:
  externalKubernetesClientConnectionOverrides:
    acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
    burst: 400
    contentType: application/vnd.kubernetes.protobuf
    qps: 200
  externalKubernetesKubeConfig: ''
  openshiftLoopbackClientConnectionOverrides:
    acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
    burst: 600
    contentType: application/vnd.kubernetes.protobuf
    qps: 300
  openshiftLoopbackKubeConfig: openshift-master.kubeconfig
masterPublicURL: https://10.10.244.212:8443
networkConfig:
  clusterNetworks:
  - cidr: 10.128.0.0/8
    hostSubnetLength: 8
  externalIPNetworkCIDRs:
  - 0.0.0.0/0
  ingressIPNetworkCIDR: 172.46.0.0/16
  networkPluginName: redhat/openshift-ovs-multitenant
  serviceNetworkCIDR: 172.30.0.0/16
oauthConfig:
  assetPublicURL: https://10.10.244.212:8443/console/
  grantConfig:
    method: auto
  identityProviders:
  - challenge: true
    login: true
    mappingMethod: claim
    name: htpasswd_auth
    provider:
      apiVersion: v1
      file: /etc/origin/master/htpasswd
      kind: HTPasswdPasswordIdentityProvider
  masterCA: ca-bundle.crt
  masterPublicURL: https://10.10.244.212:8443
  masterURL: https://10.10.244.212:8443
  sessionConfig:
    sessionMaxAgeSeconds: 3600
    sessionName: ssn
    sessionSecretsFile: /etc/origin/master/session-secrets.yaml
  tokenConfig:
    accessTokenMaxAgeSeconds: 86400
    authorizeTokenMaxAgeSeconds: 86400
pauseControllers: false
policyConfig:
  bootstrapPolicyFile: /etc/origin/master/policy.json
  openshiftInfrastructureNamespace: openshift-infra
  openshiftSharedResourcesNamespace: openshift
projectConfig:
  defaultNodeSelector: node-role.kubernetes.io/compute=true
  projectRequestMessage: ''
  projectRequestTemplate: ''
  securityAllocator:
    mcsAllocatorRange: s0:/2
    mcsLabelsPerProject: 5
    uidAllocatorRange: 1000000000-1999999999/10000
routingConfig:
  subdomain: svc.cluster.local
serviceAccountConfig:
  limitSecretReferences: false
  managedNames:
  - default
  - builder
  - deployer
  masterCA: ca-bundle.crt
  privateKeyFile: serviceaccounts.private.key
  publicKeyFiles:
  - serviceaccounts.public.key
servingInfo:
  bindAddress: 0.0.0.0:8443
  bindNetwork: tcp4
  certFile: master.server.crt
  clientCA: ca.crt
  keyFile: master.server.key
  maxRequestsInFlight: 500
  requestTimeoutSeconds: 3600
volumeConfig:
  dynamicProvisioningEnabled: true

Duplicate of #9619

@vrutkovs cheers, so i was right in saying the PR does fix the issue ;)

Hi All,
I noticed the same behavior with the openshift-ansible-3.10.33-1.

On master nodes the pod k8s_api_master-api... is exited and the problem is related to a missing connectivity to etcd.

[root@ocp-devmaster01 openshift-ansible]# docker logs -f --tail=10 750c9e7631eb
I0822 06:45:20.892219       1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I0822 06:45:20.892233       1 plugins.go:84] Registered admission plugin "ResourceQuota"
I0822 06:45:20.892246       1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I0822 06:45:20.892258       1 plugins.go:84] Registered admission plugin "Priority"
I0822 06:45:20.892273       1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I0822 06:45:20.892286       1 plugins.go:84] Registered admission plugin "ServiceAccount"
I0822 06:45:20.892305       1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I0822 06:45:20.892318       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I0822 06:45:20.892334       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
F0822 06:45:50.897544       1 start_api.go:68] dial tcp 192.168.98.208:2379: getsockopt: connection refused

The etcd container isn't running on the master nodes.

Marcello

Nearly but not quite the same problem here. Dumping the log of the master api image shows a file is missing. Causing the service to fail on startup.

# docker logs -f --tail 10 k8s_api_master-api-server7_kube-system_5946c1f644096161a1242b3de0ee5875_3085
Invalid MasterConfig /etc/origin/master/master-config.yaml
  oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory
#

Inspecting the logs for the master controller image shows the same missing file as well.

# docker logs -f --tail 10 k8s_controllers_master-controllers-benchserver7_kube-system_8e879171c85e221fb0a023e3f10ca276_3084
Invalid MasterConfig /etc/origin/master/master-config.yaml
  oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory
#

Dropping into the master api image to double check the missing file shows the directory the file should be located in is missing

# docker run -it --entrypoint /bin/bash  registry.access.redhat.com/openshift3/ose-control-plane
Unable to find image 'registry.access.redhat.com/openshift3/ose-control-plane:latest' locally
Trying to pull repository registry.access.redhat.com/openshift3/ose-control-plane ... 
latest: Pulling from registry.access.redhat.com/openshift3/ose-control-plane
Digest: sha256:fb05ab7dad76f91201660824a5e88d4e17989fb2ef34ce0522eafd7604cf41f0
Status: Downloaded newer image for registry.access.redhat.com/openshift3/ose-control-plane:latest
[root@0f1f30e36bfd origin]# ls /etc/origin
ls: cannot access /etc/origin: No such file or directory
[root@0f1f30e36bfd origin]#

Shouldn't the image(s) contain this directory ?

# git describe
openshift-ansible-3.10.27-2-171-g481000a0b

oauthConfig.identityProvider[0].provider.file: Invalid value: "/etc/origin/openshift-passwd": could not read file: stat /etc/origin/openshift-passwd: no such file or directory

Looks like your openshift_master_identity_providers has incorrect path to htpasswd. Is this file present on the host?

Looks like your openshift_master_identity_providers has incorrect path to htpasswd. Is this file present on the host?

yes

Does it work if its moved to /etc/origin/master?

Correction. My last comment was incorrect. The file is not present on the host.

Folks, 'Wait for control plane pods to appear' failing means API server failed to start. There might be a billion reasons for that - unreachable pod image pullspec, wrong API server configuration, something wring with docker etc.

Lets not post 'I have this issue too' comments, because the same symptom doesn't mean the same issue is causing it - or that ansible playbooks are incorrect.

@vrutkovs Your absolutely right. Sorry.

facing the same issue...i am suspecting that its a DNS issue..i checked before and after the config.yml script executiion that DNS name is changing...

[root@master ~]# cat /etc/resolv.conf

nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh

Generated by NetworkManager

search cluster.local example.com
nameserver 10.0.0.11
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0|grep -i DNS
DNS1=10.0.0.1

Openshift error:
TASK [openshift_control_plane : Report control plane errors] **************************************************************************************
fatal: [master.example.com]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}

fatal: [master.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "get", "events", "--config=/etc/origin/master/admin.kubeconfig", "-n", "kube-system"], "delta": "0:00:00.208492", "end": "2018-10-15 16:36:02.875601", "msg": "non-zero return code", "rc": 1, "start": "2018-10-15 16:36:02.667109", "stderr": "The connection to the server master.example.com:8443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server master.example.com:8443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

                         Please suggest any resolution.
  1. Make sure the hostname command on your host give FQDN, if not set it with hostnamectl set-hostname <FQDN hostname>

Add below properties in /etc/sysconfig/network-scripts/ifcfg-<interface_name>

NM_CONTROLLED=yes
PEERDNS=yes

Also to fix upstream DNS quickly you can follow below suppose your DNS addresses are 10.99.1.2 and 10.99.1.3

# nmcli con mod ens192 ipv4.dns 10.99.1.2,10.99.1.3
# systemctl restart NetworkManager
# systemctl restart dnsmasq
# cat /etc/dnsmasq.d/origin-upstream-dns.conf
# cat /etc/resolv.conf

The below fixed my issue. I use a proxy in my environment. I had to add the hostname to no_proxy

$ cat < /etc/environment
http_proxy=http://10.xx.xx.xx:8080
https_proxy=http://10.xx.xx.xx:8080
ftp_proxy=http://10.xx.xx.xx:8080
no_proxy=127.0.0.1,localhost,172.17.240.84,172.17.240.85,172.17.240.86,172.17.240.87,10.96.0.0/12,10.244.0.0/16,v-openshift1-lnx1,v-node01-lnx1,v-node02-lnx1,console,console.inet.co.za
EOF

$ cat < /etc/systemd/system/docker.service.d/no-proxy.conf
[Service]
Environment="NO_PROXY=artifactory-za.devel.iress.com.au, 172.30.9.71, 172.17.240.84, 172.17.240.85, 172.17.240.86, 172.17.240.87"
Environment="HTTP_PROXY=http://10.xx.xx.xx:8080/"
Environment="HTTPS_PROXY=http://10.xx.xx.xx:8080/"
EOF