JUGGHM/PENet_ICRA2021

Runtime measurement

MarSpit opened this issue · 4 comments

Hi there,
thank you very much for your excellent work and for publishing it.

I am trying to implement a "light weight" version of the ENet which aims to be faster in computation.
In order to have the runtime of the ENet on my hardware (Tesla V100 GPU) as a benchmark, I was trying to measure it. Being aware of this issue #4, I took account of the torch.cuda.synchronize() command.
Measuring the time this way, I obtained 10,5 ms runtime for processing a single image.
However, I realized that I can not compute more than 6 depth images per second (while having a GPU load of 100 %), which indicated to me that something was wrong.
Doing further investigations, I came across the PyTorch profiler which seems to be the official tool for correct GPU time measurement . Measuring the time that way I got 150 ms, which is in accordance with my maximum frame rate, as data preprocessing on the CPU comes on top.

Is it possible, that the times you measured are still not the proper execution times of the network, but rather the kernel launch times ?

PyTorch profiler

Thanks for your interest! I have not used pytorch profile before. I would probably try it recently, but it will take some days.

Hi there, thank you very much for your excellent work and for publishing it.

I am trying to implement a "light weight" version of the ENet which aims to be faster in computation. In order to have the runtime of the ENet on my hardware (Tesla V100 GPU) as a benchmark, I was trying to measure it. Being aware of this issue #4, I took account of the torch.cuda.synchronize() command. Measuring the time this way, I obtained 10,5 ms runtime for processing a single image. However, I realized that I can not compute more than 6 depth images per second (while having a GPU load of 100 %), which indicated to me that something was wrong. Doing further investigations, I came across the PyTorch profiler which seems to be the official tool for correct GPU time measurement . Measuring the time that way I got 150 ms, which is in accordance with my maximum frame rate, as data preprocessing on the CPU comes on top.

Is it possible, that the times you measured are still not the proper execution times of the network, but rather the kernel launch times ?

Hi Marspit, I have just tested the runtime with a short script where a same tensor is fed into ENet for 100 times. The model has been warmed up and gradient calculation is forbidden. As I have no 2080ti now, the experiments are conducted on a 3090 without no other payload. Our results are:

(1) Measuring by time.time() (of course synchronized): 2.9389s
(2) Measuring by torch.cuda.event() reference: 2.9519s
(3) Measuring by torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]): GPU time 2.946s, CPU time 2.575s

So I don't think our measurement lead to such a large speed inconsistency up to 50%. From individual experiences IO interaction could be one performance bottleneck if some other programs are being executed at the same time.

Hi JUGGHM,
many thanks for checking my point and doing the new measurements. The times you measured in different ways seem pretty similar, not sure why I get such a difference measuring it with the Profiler and with time.time() with the synchronize command.
However, there seems to be a big difference to the time you state for the ENet on the 2080Ti GPU, which is 0,064 s vs the 2,9.. s you measured now.
Do you have a guess what the reason for this is? 2,9.. s seems pretty long on a GPU?

Hi JUGGHM, many thanks for checking my point and doing the new measurements. The times you measured in different ways seem pretty similar, not sure why I get such a difference measuring it with the Profiler and with time.time() with the synchronize command. However, there seems to be a big difference to the time you state for the ENet on the 2080Ti GPU, which is 0,064 s vs the 2,9.. s you measured now. Do you have a guess what the reason for this is? 2,9.. s seems pretty long on a GPU?

2.9s is for 100-time repetition on one 3090, indicating an inference speed of ~30ms per sample.