“cluster_machine_list” critical issue -- Machine GPU numbers NOT match to job log

Question

“cluster_machine_list” critical issue -- Machine GPU numbers NOT match to job log

Qinghao-Hu opened this issue 4 years ago · 3 comments

Recently, I analyzed the trace data and find the “cluster_machine_list” does not match "cluster_job_log".

For instance, one job log shows below, which submit an 8-GPUs job to machine "m51". However, "m51" only has 2 GPUs inside the machine.

m51,2, 12GB

{
    "status": "Pass",
    "vc": "2869ce",
    "jobid": "application_1506638472019_12703",
    "attempts": [
        {
            "start_time": "2017-10-06 14:40:02",
            "end_time": "2017-10-09 05:19:16",
            "detail": [
                {
                    "ip": "**m51**",
                    "gpus": [
                        "gpu0",
                        "gpu1",
                        "gpu2",
                        "gpu3",
                        "gpu4",
                        "gpu5",
                        "gpu6",
                        "gpu7
                    ]
                }
            ]
        }
    ]

Furthermore, I analyzed "cluster_gpu_log" and found the GPU number is totally different from the “cluster_machine_list” :

Machine details from “cluster_machine_list”

Total Machine Numbers	2 GPU Machine(12GB) Numbers	8 GPU Machine(24GB) Numbers
552	321	231

However,

Machine details analyze from “cluster_gpu_log”

Total Machine Numbers	8 GPU Machine Numbers	4 GPU Machine Numbers	0 GPU Machine Numbers	others(3 or 2 GPUs
552	264	271	13	4

I am really confused about the trace, could you please give me an explanation of it?

Answer 1 · 2021-09-03T19:05:31.000Z

@Tonyhao96 any clues about the issue you mentioned here?

Answer 2 · 2021-09-04T03:38:57.000Z

@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.

Answer 3 · 2021-09-04T15:44:14.000Z

@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.
Sorry, I should have made my question more clear. I meant whether you know the reason for this mismatch.