openshift/openshift-ansible

playbooks/adhoc/uninstall.yml breaks DNS resolution on the host

Closed this issue · 6 comments

ccll commented

Description

After running playbooks/adhoc/uninstall.yaml, the DNS resolution breaks on the host (both master and worker nodes).

Version
  • Your ansible version per ansible --version
ansible 2.8.2
  config file = /opt/okd/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Apr  9 2019, 14:30:50) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

  • The output of git describe
    openshift-ansible-3.11.136-1-2-gc62475b
Steps To Reproduce

(NOTE: I'm running ansible-playbook on the MASTER node, instead of my local dev machine because I'm having a slow SSH connection to my cluster hosts, but I guess this is not the key point here, running ansible from my local dev machine yields the same results)

  1. $ ansible-playbook -i inventory playbooks/prerequisites.yml

  2. $ ansible-playbook -i inventory playbooks/deploy_cluster.yml

  3. $ ansible-playbook -i inventory playbooks/adhoc/uninstall.yml

  4. $ ansible-playbook -i inventory playbooks/prerequisites.yml

...
TASK [Gathering Facts] ******************************************************************************************************************************************************
 [WARNING]: Unhandled error in Python interpreter discovery for host worker-1.xxxx.com: Failed to connect to the host via ssh: ssh: Could not resolve hostname
worker-1.okd.xxxx.com: Name or service not known

fatal: [worker-1.okd.xxxx.com]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"worker-1.okd.xxxx.com\". Make sure this host can be reached over ssh: ssh: Could not resolve hostname worker-1.okd.xxxx.com: Name or service not known\r\n", "unreachable": true}
...
  1. $ nslookup www.google.com
Server:		167.71.200.159
Address:	167.71.200.159#53

** server can't find www.google.com.cluster.local: REFUSED
  1. (ssh into worker nodes, the same thing, DNS can't resolve.)
Expected Results

I expect the cluster could be deployed again successfully.

Observed Results

DNS on the host is broken, so I basically can't do anything on the hosts again, except destroy and re-provision all of them.

Additional Information

Provide any additional information which may help us diagnose the
issue.

  • Your operating system and version, ie: RHEL 7.2, Fedora 23 ($ cat /etc/redhat-release)
    CentOS Linux release 7.6.1810 (Core)

  • Your inventory file (especially any non-standard configuration parameters)

# Create an OSEv3 group that contains the masters, nodes, and etcd groups
[OSEv3:children]
masters
nodes
etcd


# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root
ansible_ssh_common_args='-o StrictHostKeyChecking=no'

# If ansible_ssh_user is not root, ansible_become must be set to true
#ansible_become=true

openshift_deployment_type=origin
openshift_version="3.11.0"
openshift_image_tag="v3.11.0"

openshift_disable_check="memory_availability,disk_availability,docker_image_availability,docker_storage"

# uncomment the following to enable htpasswd authentication; defaults to AllowAllPasswordIdentityProvider
#openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

openshift_master_default_subdomain=app.okd.xxxx.com

###openshift_master_api_port=443
###openshift_master_console_port=443

openshift_console_install=true

# These two hostnames should differ, otherwise custom named certificates will not work.
openshift_master_cluster_hostname=master.okd.xxxx.com
openshift_master_cluster_public_hostname=public.okd.xxxx.com

# certs for master and wildcard app router
openshift_certificate_expiry_fail_on_warn=false
openshift_master_overwrite_named_certificates=true 
openshift_master_named_certificates=[ {"certfile": "{{inventory_dir}}/certs/public.server.pem", "keyfile": "{{inventory_dir}}/certs/public.privkey.pem", "cafile": "{{inventory_dir}}/certs/public.ca.pem"} ]
openshift_hosted_router_certificate={"certfile": "{{inventory_dir}}/certs/app.server.pem", "keyfile": "{{inventory_dir}}/certs/app.privkey.pem", "cafile": "{{inventory_dir}}/certs/app.ca.pem"} 


# host group for masters
[masters]
master.okd.xxxx.com


# host group for etcd
[etcd]
master.okd.xxxx.com


# host group for nodes, includes region info
[nodes]
master.okd.xxxx.com openshift_node_group_name='node-config-master-infra'
worker-1.okd.xxxx.com openshift_node_group_name='node-config-compute'

netstat on the master node:

$ netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      3068/dnsmasq        
tcp        0      0 167.71.200.159:53       0.0.0.0:*               LISTEN      3068/dnsmasq        
tcp        0      0 10.15.0.6:53            0.0.0.0:*               LISTEN      3068/dnsmasq        
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3460/sshd           
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      3392/master         
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 fe80::44bb:1cff:fefb:53 :::*                    LISTEN      3068/dnsmasq        
tcp6       0      0 :::22                   :::*                    LISTEN      3460/sshd           
tcp6       0      0 ::1:25                  :::*                    LISTEN      3392/master

I am facing the same issue. Any updates would be appreciated.

When openshift is deployed, Network manager updates /etc/resolv.conf to localhost. Upstream DNS entries which exist before deployment were moved to a file under /etc/dnsmasq.d and /etc/origin/node/resolv.conf.
When you run uninstall, /etc/resolv.conf is not restored with upstream dns. After you run uninstall, update /etc/resolv.conf with the DNS servers.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.