playbooks/adhoc/uninstall.yml breaks DNS resolution on the host
Closed this issue · 6 comments
Description
After running playbooks/adhoc/uninstall.yaml
, the DNS resolution breaks on the host (both master and worker nodes).
Version
- Your ansible version per
ansible --version
ansible 2.8.2
config file = /opt/okd/openshift-ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Apr 9 2019, 14:30:50) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
- The output of
git describe
openshift-ansible-3.11.136-1-2-gc62475b
Steps To Reproduce
(NOTE: I'm running ansible-playbook on the MASTER node, instead of my local dev machine because I'm having a slow SSH connection to my cluster hosts, but I guess this is not the key point here, running ansible from my local dev machine yields the same results)
-
$ ansible-playbook -i inventory playbooks/prerequisites.yml
-
$ ansible-playbook -i inventory playbooks/deploy_cluster.yml
-
$ ansible-playbook -i inventory playbooks/adhoc/uninstall.yml
-
$ ansible-playbook -i inventory playbooks/prerequisites.yml
...
TASK [Gathering Facts] ******************************************************************************************************************************************************
[WARNING]: Unhandled error in Python interpreter discovery for host worker-1.xxxx.com: Failed to connect to the host via ssh: ssh: Could not resolve hostname
worker-1.okd.xxxx.com: Name or service not known
fatal: [worker-1.okd.xxxx.com]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"worker-1.okd.xxxx.com\". Make sure this host can be reached over ssh: ssh: Could not resolve hostname worker-1.okd.xxxx.com: Name or service not known\r\n", "unreachable": true}
...
$ nslookup www.google.com
Server: 167.71.200.159
Address: 167.71.200.159#53
** server can't find www.google.com.cluster.local: REFUSED
- (ssh into worker nodes, the same thing, DNS can't resolve.)
Expected Results
I expect the cluster could be deployed again successfully.
Observed Results
DNS on the host is broken, so I basically can't do anything on the hosts again, except destroy and re-provision all of them.
Additional Information
Provide any additional information which may help us diagnose the
issue.
-
Your operating system and version, ie: RHEL 7.2, Fedora 23 (
$ cat /etc/redhat-release
)
CentOS Linux release 7.6.1810 (Core)
-
Your inventory file (especially any non-standard configuration parameters)
# Create an OSEv3 group that contains the masters, nodes, and etcd groups
[OSEv3:children]
masters
nodes
etcd
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
# If ansible_ssh_user is not root, ansible_become must be set to true
#ansible_become=true
openshift_deployment_type=origin
openshift_version="3.11.0"
openshift_image_tag="v3.11.0"
openshift_disable_check="memory_availability,disk_availability,docker_image_availability,docker_storage"
# uncomment the following to enable htpasswd authentication; defaults to AllowAllPasswordIdentityProvider
#openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_default_subdomain=app.okd.xxxx.com
###openshift_master_api_port=443
###openshift_master_console_port=443
openshift_console_install=true
# These two hostnames should differ, otherwise custom named certificates will not work.
openshift_master_cluster_hostname=master.okd.xxxx.com
openshift_master_cluster_public_hostname=public.okd.xxxx.com
# certs for master and wildcard app router
openshift_certificate_expiry_fail_on_warn=false
openshift_master_overwrite_named_certificates=true
openshift_master_named_certificates=[ {"certfile": "{{inventory_dir}}/certs/public.server.pem", "keyfile": "{{inventory_dir}}/certs/public.privkey.pem", "cafile": "{{inventory_dir}}/certs/public.ca.pem"} ]
openshift_hosted_router_certificate={"certfile": "{{inventory_dir}}/certs/app.server.pem", "keyfile": "{{inventory_dir}}/certs/app.privkey.pem", "cafile": "{{inventory_dir}}/certs/app.ca.pem"}
# host group for masters
[masters]
master.okd.xxxx.com
# host group for etcd
[etcd]
master.okd.xxxx.com
# host group for nodes, includes region info
[nodes]
master.okd.xxxx.com openshift_node_group_name='node-config-master-infra'
worker-1.okd.xxxx.com openshift_node_group_name='node-config-compute'
netstat
on the master node:
$ netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 3068/dnsmasq
tcp 0 0 167.71.200.159:53 0.0.0.0:* LISTEN 3068/dnsmasq
tcp 0 0 10.15.0.6:53 0.0.0.0:* LISTEN 3068/dnsmasq
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3460/sshd
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 3392/master
tcp6 0 0 :::111 :::* LISTEN 1/systemd
tcp6 0 0 fe80::44bb:1cff:fefb:53 :::* LISTEN 3068/dnsmasq
tcp6 0 0 :::22 :::* LISTEN 3460/sshd
tcp6 0 0 ::1:25 :::* LISTEN 3392/master
I am facing the same issue. Any updates would be appreciated.
When openshift is deployed, Network manager updates /etc/resolv.conf to localhost. Upstream DNS entries which exist before deployment were moved to a file under /etc/dnsmasq.d and /etc/origin/node/resolv.conf.
When you run uninstall, /etc/resolv.conf is not restored with upstream dns. After you run uninstall, update /etc/resolv.conf with the DNS servers.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.