msr-fiddle/philly-traces

Confusion about the data and advice for the notebook code

Panlichen opened this issue · 2 comments

Confusion about the data

The example cluster_job_log entry in README.md shows the first attempt's endtime is "2017-10-08 21:08:07", but I find it's actually None in the dataset.

The reason why I want to look into this example is that it seems this job's two attempts run simultaneously, which means the job can use 16 GPUs at the same time. This does not match the way you use to calculate num_gpus in the notebook.

Advice for the notebook code

I got TypeError: '>=' not supported between instances of 'NoneType' and 'int' when it runs to bucket = get_bucket_from_num_gpus(num_gpus), so I add lines in the get_bucket_from_num_gpus method to fix it:

def get_bucket_from_num_gpus(num_gpus):
    """Maps GPU count to a bucket for plotting purposes."""
    if num_gpus is None:
        return None
    elif num_gpus == 1:
        return 0
    ...

Besides, when I run code in the GPU Utilization (Figures 5, 6) chapter, my PC crashed, I guess it is because my 16G memory cannot hold the whole data structure. Maybe the code could be more friendly for computers without a lot of memory?

Hi Panlichen,

Thanks for bringing these issues to our attention! Regarding the example in the README, you're right that the end time for the first attempt is actually None, however this does not indicate that the two attempts were executing concurrently. Instead, this is most likely an artifact of a logging error that resulted in a missing entry for the true end time. We mention in the README notes that some attempts might have missing start or end times - the example you saw was one such case. To reiterate, the total number of GPUs used by each job remains fixed across all attempts (though they may be spread out between more machines). We have updated the example to use a more well-behaved job in #4.

Regarding the TypeError you were seeing, we have confirmed that this was a Python 2.7/Python 3.7 incompatability issue. We have fixed this as well in #4.

And finally regarding the memory issues you were seeing, we will take your suggestion into consideration but we would also welcome a PR addressing this if you wanted to contribute! :)

Closing as this was addressed by #4.