princeton-vl/CornerNet-Lite

Question about training CornerNet-squeeze on Tesla v-100

Opened this issue · 0 comments

When using 4 or more GPU(tesla v100) to train the model, they seem lower than only using one or two:
1, using 22080Ti with batch-size 24 and chunksize[12,12] is the fastest, 1.22 it/s
2, using 1
v100 with batch-size 16 seems to double the training time, and the Memory-Usage of GPU is quite low, 2.41s/it
3, So I tried batch-size 128 with chunk[32,32,32,32], it turned to be even lower than using only 1 v100, and the GPU-util is very low, 6.75s/it
4, batch-size 320 with chunk-size [40,40,40,40,40,40,40,40] turned out to be the lowest, even this enjoys high Memory-Usage of GPUm, GPU-util is the lowest.

It seems that the problem happened with the periods when the CPU load the data(correct me if I am wrong).
So I wonder what is the suggested config for tesla v-100? Also, I found it might be helpful using DataLoader. Is it possible for me to use this method on the CornerNet-Lite?