oracle-quickstart/oci-hpc

Potential problem handling array jobs

cbutakoff opened this issue · 2 comments

I have an array job limited to 2 jobs at a time:

   2145_[4-190%2]   compute   EP_108      opc PD       0:00     10 (JobArrayTaskLimit)
   2145_3   compute   EP_108      opc  R      48:42     10 compute-hpc-node-[100,373,397,421,425,429,455,457,813,896]
   2145_2   compute   EP_108      opc  R    4:13:06     10 compute-hpc-node-[69,237,245,272,347,553,724,817,931,993]

But slurm or oci tries to still provision the 3rd cluster and fails (because of lack of available nodes) but it just keeps on retrying. E.g.:
Selection_022

Fixed by modifying the queues.conf in principle and setting max clusters to 2

autoscaling with Arrays should be fixed in 2.10.4.