MFlowCode/MFC

Frontier CI sporadically failed 2-rank test in test suite

sbryngelson opened this issue · 1 comments

Frontier CI sporadically failed 2-rank test in test suite.

One such example: https://github.com/MFlowCode/MFC/actions/runs/9230003014/job/25397349146

One note is that we automatically get all GPUs when we request a node (exclusive access) on Frontier, though I'm not sure how many we actually use.

I pass this argument in the batch job:

https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/frontier/test.sh#L3C1-L3C37

which I guess tries to run 4 tests at once. So, there should always be enough GPUs available. Maybe they are overlapping when we have the 2 rank case?

Notably this is much different than the clever setup that @henryleberre wrote for the Phoenix case:

n_test_threads=8
if [ "$job_device" == "gpu" ]; then
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
fi
./mfc.sh test -a -j $n_test_threads $device_opts -- -c phoenix

This issue seems stale and less relevant these days. Closing.