fgci-org/ansible-role-cuda

Reboot node after CUDA install fails when using a bastion host

Closed this issue · 4 comments

The gpu nodes and compute nodes are in the FGCI test environment are accessed via a bastion ssh host.

The wait_for with local_action in this role's handler doesn't work as the machine where ansible runs can't ssh directly to the compute/gpu nodes (I think).

Solution(s):

  • Run ansible on the bastion host.
  • Is there another way to get around this?

https://github.com/CSC-IT-Center-for-Science/ansible-role-cuda/blob/master/handlers/main.yml

Enabled rebooting of the GPU node again in ecc2d6a
The reboot works but unfortunately the playbook still fails.

As we're now using ansible-pull this is less of a problem.

This is not actually tested because in our current provisioning ansible-pull runs on the first boot directly after boot (via rc.local) - so there's no window to test this without disabling ansible-pull on a node.