OKD 3.11 install failure when Approving Node Certifcates
Closed this issue · 2 comments
Description
On a multi-master install of OKD 3.11 ansible fails with message
Could not find csr for nodes: ip-10-131-15-14.my-domain.xyz, ip-10-131-15-13.my-domain.xyz, ip-10-131-15-12.my-domain.xyz
Version
Please put the following version information in the code block
indicated below.
-
Your ansible version per
ansible --version
Ansible version 2.8.5
-
openshift-ansible version
-- Unsure: integrated into new repository about 2 months ago ---
Steps To Reproduce
Ansible is run through a Jenkins build. Uninstalling and reinstalling, even rebuilding the machines produces the same result.
Expected Results
Install to complete
Observed Results
Describe what is actually happening.
PLAY [Approve any pending CSR requests from inventory nodes] *******************
TASK [Dump all candidate bootstrap hostnames] **********************************
ok: [ip-10-131-15-10.my-domain.xyz] => {
"msg": [
"ip-10-131-15-10.my-domain.xyz",
"ip-10-131-15-11.my-domain.xyz",
"ip-10-131-15-12.my-domain.xyz",
"ip-10-131-15-13.my-domain.xyz",
"ip-10-131-15-14.my-domain.xyz"
]
}
TASK [Find all hostnames for bootstrapping] ************************************
ok: [ip-10-131-15-10.my-domain.xyz]
TASK [Dump the bootstrap hostnames] ********************************************
ok: [ip-10-131-15-10.my-domain.xyz] => {
"msg": [
"ip-10-131-15-10.my-domain.xyz",
"ip-10-131-15-11.my-domain.xyz",
"ip-10-131-15-12.my-domain.xyz",
"ip-10-131-15-13.my-domain.xyz",
"ip-10-131-15-14.my-domain.xyz"
]
}
TASK [Approve node certificates when bootstrapping] ****************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
.....
Failure summary:
1. Hosts: ip-10-131-15-10.my-domain.xyz
Play: Approve any pending CSR requests from inventory nodes
Task: Approve node certificates when bootstrapping
Message: Could not find csr for nodes: ip-10-131-15-14.my-domain.xyz, ip-10-131-15-13.my-domain.xyz, ip-10-131-15-12.my-domain.xyz
Build step 'Execute shell' marked build as failure
Finished: FAILURE
[ip-10-131-15-10]> oc get csr
Unable to connect to the server: x509: certificate signed by unknown authority
Referencing issue here I see that my .kube/config
and /etc/origin/master/admin.kubeconfig
are different and produce different md5sums.
[root@ip-10-131-15-10]> diff config /etc/origin/master/admin.kubeconfig
4c4
< certificate-authority-data: <ABC>
---
> certificate-authority-data: <XYZ>
19,20c19,20
< client-certificate-data: <ABC>
< client-key-data: <XYZ>
---
> client-certificate-data: <ABC>
> client-key-data: <XYZ>
Additional Information
- OS version
```Centos7````
Hardened with CIS Level 3 hardening scripts
Provide any additional information which may help us diagnose the
issue.
- Inventory file
[OSEv3:children]
masters
etcd
nodes
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=ec2-user
# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true
# Deploy OKD 3.11.
openshift_deployment_type=origin
openshift_release=v3.11
# We need a wildcard DNS setup for our public access to services, fortunately
# we can use the superb xip.io to get one for free.
openshift_public_hostname=ip-10-131-15-10.my-domain.xyz
openshift_master_default_subdomain=ip-10-131-15-10.my-domain.xyz
# Use an htpasswd file as the indentity provider.
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
# Uncomment the line below to enable metrics for the cluster.
# openshift_hosted_metrics_deploy=true
# Use API keys rather than instance roles so that tenant containers don't get
# Openshift's EC2/EBS permissions
openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=<XYZ>
openshift_cloudprovider_aws_secret_key=<ABC>
# Set the cluster_id.
openshift_clusterid=openshift-cluster-aws
# Disable image availability health check... For some reason, skopeo does not work when using sudo
openshift_disable_check="docker_image_availability"
# Global Proxy Configuration
# These options configure HTTP_PROXY, HTTPS_PROXY, and NOPROXY environment
# variables for docker and master services.
openshift_http_proxy=http://forwardproxy.my-domain.xyz:9081
openshift_https_proxy=http://forwardproxy.my-domain.xyz:9081
openshift_no_proxy='127.0.0.1,localhost,.my-domain.xyz,.service.consul,169.254.169.254'
#
# Most environments do not require a proxy between OpenShift masters, nodes, and
# etcd hosts. So automatically add those hostnames to the openshift_no_proxy list.
# If all of your hosts share a common domain you may wish to disable this and
# specify that domain above.
# openshift_generate_no_proxy_hosts=True
osm_cluster_network_cidr=10.128.0.0/16
os_firewall_use_firewalld=false
openshift_master_bootstrap_auto_approve=True
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master-infra', 'labels': ['node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true']}, {'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true,node-role.kubernetes.io/compute=true']}]
# Create the masters host group. Note that due do:
# https://github.com/dwmkerr/terraform-aws-openshift/issues/40
# We cannot use the internal DNS names (such as master.openshift.local) as there
# is a bug with the installer when using the AWS cloud provider.
# Note that we use the master node as an infra node as well, which is not recommended for production use.
[masters]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11
# host group for etcd
[etcd]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11
# all nodes - along with their openshift_node_groups.
[nodes]
ip-10-131-15-10.my-domain.xyz hostname=ip-10-131-15-10.my-domain.xyz ip=10.131.15.10 openshift_node_group_name='node-config-master-infra' openshift_schedulable=true
ip-10-131-15-11.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.11 openshift_node_group_name='node-config-master-infra' openshift_schedulable=true
ip-10-131-15-12.my-domain.xyz hostname=ip-10-131-15-11.my-domain.xyz ip=10.131.15.12 openshift_node_group_name='node-config-compute'
ip-10-131-15-13.my-domain.xyz hostname=ip-10-131-15-13.my-domain.xyz ip=10.131.15.13 openshift_node_group_name='node-config-compute'
ip-10-131-15-14.my-domain.xyz hostname=ip-10-131-15-14.my-domain.xyz ip=10.131.15.14 openshift_node_group_name='node-config-compute'
I tried the this advice and re-ran the deploy_cluster.yml and it still fails in the same spot. However, oc
commands now work when run as root.
$ oc status
In project default on server https://ip-10-131-15-10.my-domain.xyz:8443
svc/kubernetes - 172.30.0.1 ports 443->8443, 53->8053, 53->8053
View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.
$ oc get csr
No resources found.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-131-15-10.my-domain.xyz Ready infra,master 16h v1.11.0+d4cacc0
ip-10-131-15-11.my-domain.xyz Ready infra,master 16h v1.11.0+d4cacc0
It seems my openshift_clusterid
may not have matched the tag on the AWS instance. The inventory file. Rebuilding the instances and rerunning with an updated clusterid
. I rebuilt my instances and re-ran the ansible and it installed cleanly.