Reboot node after CUDA install fails when using a bastion host

Question

Reboot node after CUDA install fails when using a bastion host

Closed this issue 9 years ago · 4 comments

The gpu nodes and compute nodes are in the FGCI test environment are accessed via a bastion ssh host.

The wait_for with local_action in this role's handler doesn't work as the machine where ansible runs can't ssh directly to the compute/gpu nodes (I think).

Solution(s):

Run ansible on the bastion host.
Is there another way to get around this?

https://github.com/CSC-IT-Center-for-Science/ansible-role-cuda/blob/master/handlers/main.yml

Answer 1 · 2015-11-27T11:13:56.000Z

Enabled rebooting of the GPU node again in ecc2d6a
The reboot works but unfortunately the playbook still fails.

Answer 2 · 2015-12-10T12:53:39.000Z

As we're now using ansible-pull this is less of a problem.

Answer 3 · 2016-02-17T10:16:10.000Z

Similar fix as in fgci-org/fgci-ansible#15

Answer 4 · 2016-02-22T06:19:30.000Z

This is not actually tested because in our current provisioning ansible-pull runs on the first boot directly after boot (via rc.local) - so there's no window to test this without disabling ansible-pull on a node.