OKD 3.11 install failure when Approving Node Certifcates

Question

OKD 3.11 install failure when Approving Node Certifcates

Closed this issue 5 years ago · 2 comments

Description

On a multi-master install of OKD 3.11 ansible fails with message
Could not find csr for nodes: ip-10-131-15-14.my-domain.xyz, ip-10-131-15-13.my-domain.xyz, ip-10-131-15-12.my-domain.xyz

Version

Please put the following version information in the code block
indicated below.

Your ansible version per ansible --version
Ansible version 2.8.5
openshift-ansible version
-- Unsure: integrated into new repository about 2 months ago ---

Steps To Reproduce

Ansible is run through a Jenkins build. Uninstalling and reinstalling, even rebuilding the machines produces the same result.

Expected Results

Install to complete

Observed Results

Describe what is actually happening.

PLAY [Approve any pending CSR requests from inventory nodes] *******************

TASK [Dump all candidate bootstrap hostnames] **********************************
ok: [ip-10-131-15-10.my-domain.xyz] => {
    "msg": [
        "ip-10-131-15-10.my-domain.xyz", 
        "ip-10-131-15-11.my-domain.xyz", 
        "ip-10-131-15-12.my-domain.xyz", 
        "ip-10-131-15-13.my-domain.xyz", 
        "ip-10-131-15-14.my-domain.xyz"
    ]
}

TASK [Find all hostnames for bootstrapping] ************************************
ok: [ip-10-131-15-10.my-domain.xyz]

TASK [Dump the bootstrap hostnames] ********************************************
ok: [ip-10-131-15-10.my-domain.xyz] => {
    "msg": [
        "ip-10-131-15-10.my-domain.xyz", 
        "ip-10-131-15-11.my-domain.xyz", 
        "ip-10-131-15-12.my-domain.xyz", 
        "ip-10-131-15-13.my-domain.xyz", 
        "ip-10-131-15-14.my-domain.xyz"
    ]
}

TASK [Approve node certificates when bootstrapping] ****************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
.....
Failure summary:


  1. Hosts:    ip-10-131-15-10.my-domain.xyz
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Approve node certificates when bootstrapping
     Message:  Could not find csr for nodes: ip-10-131-15-14.my-domain.xyz, ip-10-131-15-13.my-domain.xyz, ip-10-131-15-12.my-domain.xyz
Build step 'Execute shell' marked build as failure
Finished: FAILURE

[ip-10-131-15-10]> oc get csr
Unable to connect to the server: x509: certificate signed by unknown authority

Referencing issue here I see that my .kube/config and /etc/origin/master/admin.kubeconfig are different and produce different md5sums.

[root@ip-10-131-15-10]> diff config /etc/origin/master/admin.kubeconfig 
4c4
<     certificate-authority-data: <ABC>
---
>     certificate-authority-data: <XYZ>
19,20c19,20
<     client-certificate-data: <ABC>
<     client-key-data: <XYZ>
---
>     client-certificate-data: <ABC>
>     client-key-data: <XYZ>

Additional Information

OS version
```Centos7````
Hardened with CIS Level 3 hardening scripts

Provide any additional information which may help us diagnose the
issue.

Inventory file

[OSEv3:children]
masters
etcd
nodes

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=ec2-user

# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true

# Deploy OKD 3.11.
openshift_deployment_type=origin
openshift_release=v3.11

# We need a wildcard DNS setup for our public access to services, fortunately
# we can use the superb xip.io to get one for free.
openshift_public_hostname=ip-10-131-15-10.my-domain.xyz
openshift_master_default_subdomain=ip-10-131-15-10.my-domain.xyz

# Use an htpasswd file as the indentity provider.
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# Uncomment the line below to enable metrics for the cluster.
# openshift_hosted_metrics_deploy=true

# Use API keys rather than instance roles so that tenant containers don't get
# Openshift's EC2/EBS permissions
openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=<XYZ>
openshift_cloudprovider_aws_secret_key=<ABC>

# Set the cluster_id.
openshift_clusterid=openshift-cluster-aws

# Disable image availability health check... For some reason, skopeo does not work when using sudo
openshift_disable_check="docker_image_availability"

# Global Proxy Configuration
# These options configure HTTP_PROXY, HTTPS_PROXY, and NOPROXY environment
# variables for docker and master services.
openshift_http_proxy=http://forwardproxy.my-domain.xyz:9081
openshift_https_proxy=http://forwardproxy.my-domain.xyz:9081
openshift_no_proxy='127.0.0.1,localhost,.my-domain.xyz,.service.consul,169.254.169.254'

#
# Most environments do not require a proxy between OpenShift masters, nodes, and
# etcd hosts. So automatically add those hostnames to the openshift_no_proxy list.
# If all of your hosts share a common domain you may wish to disable this and
# specify that domain above.
# openshift_generate_no_proxy_hosts=True

osm_cluster_network_cidr=10.128.0.0/16

os_firewall_use_firewalld=false
openshift_master_bootstrap_auto_approve=True 

openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master-infra', 'labels': ['node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true']}, {'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true,node-role.kubernetes.io/compute=true']}]

# Create the masters host group. Note that due do:
#   https://github.com/dwmkerr/terraform-aws-openshift/issues/40
# We cannot use the internal DNS names (such as master.openshift.local) as there
# is a bug with the installer when using the AWS cloud provider.
# Note that we use the master node as an infra node as well, which is not recommended for production use.
[masters]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11

# host group for etcd
[etcd]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11

# all nodes - along with their openshift_node_groups.
[nodes]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10 openshift_node_group_name='node-config-master-infra' openshift_schedulable=true
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11 openshift_node_group_name='node-config-master-infra' openshift_schedulable=true
ip-10-131-15-12.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.12 openshift_node_group_name='node-config-compute'
ip-10-131-15-13.my-domain.xyz hostname=ip-10-131-15-13.my-domain.xyz ip=10.131.15.13 openshift_node_group_name='node-config-compute'
ip-10-131-15-14.my-domain.xyz hostname=ip-10-131-15-14.my-domain.xyz ip=10.131.15.14 openshift_node_group_name='node-config-compute'

Answer 1 · 2019-10-03T13:16:01.000Z

I tried the this advice and re-ran the deploy_cluster.yml and it still fails in the same spot. However, oc commands now work when run as root.

$ oc status
In project default on server https://ip-10-131-15-10.my-domain.xyz:8443

svc/kubernetes - 172.30.0.1 ports 443->8443, 53->8053, 53->8053

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

$ oc get csr
No resources found.

$ oc get nodes
NAME                             STATUS    ROLES          AGE       VERSION
ip-10-131-15-10.my-domain.xyz   Ready     infra,master   16h       v1.11.0+d4cacc0
ip-10-131-15-11.my-domain.xyz   Ready     infra,master   16h       v1.11.0+d4cacc0

Answer 2 · 2019-10-03T19:35:18.000Z

It seems my openshift_clusterid may not have matched the tag on the AWS instance. The inventory file. Rebuilding the instances and rerunning with an updated clusterid. I rebuilt my instances and re-ran the ansible and it installed cleanly.