napalm-automation/napalm-ansible

Error with napalm_install_config for nxos_ssh OS - Search pattern never detected in send_command_expect: [hostname]

AdamSmith-Mtl opened this issue · 4 comments

I am using napalm-ansible to install configuration on various equipment and I am getting an error on a specific Nexus7k host.

Everything works fine on my other three Nexus (and on other types of equipment and on all equipment in my dev environment), so this should not be a problem with the way I use the module.

Here is the part in my playbook where I use the napalm_install_config module:

---

- name: "{{ equipment_family }} - Install the configuration using the napalm-ansible library"
  napalm_install_config:
    hostname: "{{ ansible_ssh_host }}"
    username: "{{ smi_username }}"
    password: "{{ smi_password }}"
    dev_os: "{{ napalm_install_config_dev_os }}"
    config_file: "{{ master_config_file }}"
    commit_changes: "{{ not ansible_check_mode }}"
    replace_config: False
    get_diffs: True
    diff_file: "{{ diff_file }}"
    timeout: "{{ napalm_install_config_timeout }}"
  any_errors_fatal: true

And here is the verbose output I get when this part runs on the problematic host:

The full traceback is:
  File "/tmp/ansible_gJc7b8/ansible_module_napalm_install_config.py", line 311, in main
    device.commit_config()
  File "/usr/lib/python2.7/site-packages/napalm/nxos/nxos.py", line 54, in commit_config
    self._save_to_checkpoint(self.backup_file)
  File "/usr/lib/python2.7/site-packages/napalm/nxos_ssh/nxos_ssh.py", line 645, in _save_to_checkpoint
    self.device.send_command(command)
  File "/usr/lib/python2.7/site-packages/netmiko/base_connection.py", line 1112, in send_command
    search_pattern))

fatal: [QCDRVLS202]: FAILED! => {
    "changed": false, 
    "invocation": {
        "module_args": {
            "archive_file": null, 
            "candidate_file": null, 
            "commit_changes": true, 
            "config": null, 
            "config_file": "/var/lib/awx/projects/_79__customer_provisioning_v72/playbooks/../files/configs/2018-08-24_11:35:30/QCDRVLS202/___master.conf", 
            "dev_os": "nxos_ssh", 
            "diff_file": "/var/lib/awx/projects/_79__customer_provisioning_v72/playbooks/../files/diffs/2018-08-24_11:35:30/QCDRVLS202.diff", 
            "get_diffs": true, 
            "hostname": "10.199.254.47", 
            "optional_args": null, 
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER", 
            "provider": {
                "hostname": "10.199.254.47", 
                "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER", 
                "timeout": 60, 
                "username": "removed"
            }, 
            "replace_config": false, 
            "timeout": 60, 
            "username": "removed"
        }
    }, 
    "msg": "cannot install config: Search pattern never detected in send_command_expect: QCDRVLS202\\#"
}

As I said, it always happens on the QCDRVLS202 Nexus even though the configuration that I send to this host is the same as the one I send to other hosts.

This error doesn't happen in check_mode. From what I gathered while searching on Google, this might be due to the fact that this specific host might be slower to respond than the three other similar hosts, which makes the prompt "time-out" while waiting for the "QCMTRLS202#" text on the command line, but I can't seem to find a way to make it wait longer.

If you require additional information, I will be glad to provide it as it is very important to me to solve this issue .

Thank you!

Can you show what you are changing for the device that is failing?

           "config_file": "/var/lib/awx/projects/_79__customer_provisioning_v72/playbooks/../files/configs/2018-08-24_11:35:30/QCDRVLS202/___master.conf",

Here is the content of the config file:


vlan 2200
  name VLAB-SMI

interface Po8
  switchport trunk allowed vlan add 2200

interface Po30
  switchport trunk allowed vlan add 2200

interface Po200
  switchport trunk allowed vlan add 2200


ip access-list tools-pbr-vlab-smi
  fragments deny-all

route-map tools-pbr-policy-556 permit 6000
  match ip address tools-pbr-vlab-smi
  set vrf MCMTRL

vlan 2800
no vlan 2800

interface Po8
  switchport trunk allowed vlan remove 2800

interface Po30
  switchport trunk allowed vlan remove 2800

interface Po200
  switchport trunk allowed vlan remove 2800

interface Po18
  switchport trunk allowed vlan remove 2800

interface Po19
  switchport trunk allowed vlan remove 2800

interface Po20
  switchport trunk allowed vlan remove 2800

interface Po21
  switchport trunk allowed vlan remove 2800

router bgp 65099
  no vrf vlab-smi-dmz
no interface Ethernet10/16.2800
no interface Vlan2800
no vrf context vlab-smi-dmz

object-group ip address ca-spectrum-vlab-smi
  host 69.158.230.99


ip access-list tools-pbr-vlab-smi
  permit ip addrgroup ca-spectrum-vlab-smi any

object-group ip address paco-vlab-smi
  host 1.1.1.1
  host 2.2.2.2


ip access-list paco-pbr-vlab-smi
  fragments deny-all
  permit ip addrgroup paco addrgroup paco-vlab-smi

route-map paco-pbr-policy permit 6000
  match ip address paco-pbr-vlab-smi
  set vrf MCMTRL

object-group ip address smarts-vlab-smi
  host 69.158.230.99


ip access-list tools-pbr-vlab-smi
  permit ip addrgroup smarts-vlab-smi any

It is exactly the same configuration as the one I send on my other N7K devices in the same environment.

This is what it is failing on:

 642     def _save_to_checkpoint(self, filename):
 643         """Save the current running config to the given file."""
 644         command = 'checkpoint file {}'.format(filename)
 645         self.device.send_command(command)

Basically, it is trying to create a checkpoint and it is taking longer than 60 seconds (possibly)? You could try increasing the timeout that you pass in and see what happens.

Here is the name of the file it is using for filename:

self.backup_file = 'config_' + str(datetime.now()).replace(' ', '_')

You could also try to manually create the checkpoint from the CLI on the device and see what happens (i.e. how long it takes; if anything goes wrong).

Thank you for your help, it works now! The problem was that this specific Nexus already had reached the maximum number of checkpoints specified in the configuration (10 in our case).

Running show checkpoint summary showed that the problematic Nexus had 10 checkpoints while the other ones had less than that. Removing old unused checkpoints solved the problem and everything works flawlessly now.

Thanks again!