Potential problem handling array jobs
cbutakoff opened this issue · 2 comments
cbutakoff commented
I have an array job limited to 2 jobs at a time:
2145_[4-190%2] compute EP_108 opc PD 0:00 10 (JobArrayTaskLimit)
2145_3 compute EP_108 opc R 48:42 10 compute-hpc-node-[100,373,397,421,425,429,455,457,813,896]
2145_2 compute EP_108 opc R 4:13:06 10 compute-hpc-node-[69,237,245,272,347,553,724,817,931,993]
But slurm or oci tries to still provision the 3rd cluster and fails (because of lack of available nodes) but it just keeps on retrying. E.g.:
cbutakoff commented
Fixed by modifying the queues.conf in principle and setting max clusters to 2
arnaudfroidmont commented
autoscaling with Arrays should be fixed in 2.10.4.