ECP-VeloC/VELOC

restart-in-place: record number of nodes used in first run, so restart logic knows whether enough healthy nodes exist

adammoody opened this issue · 1 comments

To know whether there are enough nodes left, it's useful to have the first job that runs record the number of nodes it used in a file. Then the scripts can process that file to get the number of nodes needed to know whether there are enough nodes for a restart. We can work around that by having the user set a variable or config param stating the number of nodes they need, like VELOC_MIN_NODES. However, it's nice to automate this, since it's one less setting for the user.

This issue stayed inactive for a long time. Please reopen if still relevant.