“cluster_machine_list” critical issue -- Machine GPU numbers NOT match to job log
Qinghao-Hu opened this issue · 3 comments
Recently, I analyzed the trace data and find the “cluster_machine_list” does not match "cluster_job_log".
For instance, one job log shows below, which submit an 8-GPUs job to machine "m51". However, "m51" only has 2 GPUs inside the machine.
m51,2, 12GB
{
"status": "Pass",
"vc": "2869ce",
"jobid": "application_1506638472019_12703",
"attempts": [
{
"start_time": "2017-10-06 14:40:02",
"end_time": "2017-10-09 05:19:16",
"detail": [
{
"ip": "**m51**",
"gpus": [
"gpu0",
"gpu1",
"gpu2",
"gpu3",
"gpu4",
"gpu5",
"gpu6",
"gpu7
]
}
]
}
]
Furthermore, I analyzed "cluster_gpu_log" and found the GPU number is totally different from the “cluster_machine_list” :
Machine details from “cluster_machine_list”
Total Machine Numbers | 2 GPU Machine(12GB) Numbers | 8 GPU Machine(24GB) Numbers |
---|---|---|
552 | 321 | 231 |
However,
Machine details analyze from “cluster_gpu_log”
Total Machine Numbers | 8 GPU Machine Numbers | 4 GPU Machine Numbers | 0 GPU Machine Numbers | others(3 or 2 GPUs |
---|---|---|---|---|
552 | 264 | 271 | 13 | 4 |
I am really confused about the trace, could you please give me an explanation of it?
@Tonyhao96 any clues about the issue you mentioned here?
@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703"
. It use 8 GPUs allocate on the machine "ip": "**m51**"
. However, m51
only equip with 2 GPUs.
@kzhang28 As I mentioned above, for instance, you can check the job
"jobid": "application_1506638472019_12703"
. It use 8 GPUs allocate on the machine"ip": "**m51**"
. However,m51
only equip with 2 GPUs.
Sorry, I should have made my question more clear. I meant whether you know the reason for this mismatch.