Frontier CI sporadically failed 2-rank test in test suite
sbryngelson opened this issue · 1 comments
Frontier CI sporadically failed 2-rank test in test suite.
One such example: https://github.com/MFlowCode/MFC/actions/runs/9230003014/job/25397349146
One note is that we automatically get all GPUs when we request a node (exclusive access) on Frontier, though I'm not sure how many we actually use.
I pass this argument in the batch job:
which I guess tries to run 4 tests at once. So, there should always be enough GPUs available. Maybe they are overlapping when we have the 2 rank case?
Notably this is much different than the clever setup that @henryleberre wrote for the Phoenix case:
MFC/.github/workflows/phoenix/test.sh
Lines 10 to 19 in 4f89f33
This issue seems stale and less relevant these days. Closing.