openshift/openshift-ansible

External DNS nameservers are removed from /etc/resolv.conf

rludva opened this issue · 4 comments

Description

During the installation of OCP 3.11.404 the nameservers are removed from /etc/resolv.conf and the deployment of a new cluster is failing because of inaccessible external resources on a network.
Before deploying the cluster uninstallation of the previous cluster was processed via ansible-playbook and also "yum update" was processed.

$ cd /usr/share/ansible/openshift-ansible && ansible-playbook -i /etc/ansible/hosts ./playbooks/prerequisites.yml
$ cd /usr/share/ansible/openshift-ansible && ansible-playbook -i /etc/ansible/hosts ./playbooks/deploy_cluster.yml
[root@torii-ichi-node ~]# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local home local.nutius.com
nameserver 192.168.0.31
<<---- !!! nameservers lines are removed, they are available in em1 device from DHCP configuration.. !!!

Here you can see that DNS is configured correctly on the output of nmcli dev show for device em1: https://gist.github.com/rludva/80cfef57f1656a82f33c18b758514f99

Provide a brief description of your issue here. For example:
DNS nameservers are removed from all nodes except load balancer and external gluster storage for the cluster.
I was trying to reinstall RHEL 7.9 on the worker node but still the same issue on this node as well as on other nodes. After running the playbook network is not accessible because of DNS issues. Nameservers are not used and configured.

Version
-bash-4.2# rpm -q openshift-ansible
openshift-ansible-3.11.404-1.git.0.d161108.el7.noarch

-bash-4.2# ansible --version
ansible 2.9.19
  config file = /usr/share/ansible/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug 13 2020, 02:51:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

-bash-4.2# rpm -qa | grep openshift
atomic-openshift-clients-3.11.404-1.git.0.dd58619.el7.x86_64
openshift-ansible-docs-3.11.404-1.git.0.d161108.el7.noarch
openshift-ansible-3.11.404-1.git.0.d161108.el7.noarch
openshift-ansible-playbooks-3.11.404-1.git.0.d161108.el7.noarch
openshift-ansible-roles-3.11.404-1.git.0.d161108.el7.noarch

-bash-4.2# rpm -qa | grep release
redhat-release-server-7.9-6.el7_9.x86_64
redhat-release-eula-7.8-0.el7.noarch

Expected Results

The cluster should be deployed without any issue.

Observed Results

External resource are not accessible.. (cdn.redhat.com)

DNS nameservers that are from DHCP configuration are not updated in /etc/resolv.conf and all resources that need to convert the name to IP address are not working correctly.
It was failing already on the prerequisites.yml playbook but I added the nameservers manually to the /etc/resolv.conf to be able to pass the playbook. But during deploy-cluster.yml probably there is a restart of NetworkManager or the nodes itself and the configuration does not persist. Problems started just when I uninstalled the previous instance of OCP and the process yum update and started the deployment again.

The output of ansible-playbook: https://gist.github.com/rludva/59962884fbbc354f4d90a5af7b6867b4
Inventory file: https://gist.github.com/rludva/9b595520c56a8e0a1f2139a49c3e0c2a

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.