bearpaw/pytorch-pose

Data loader is slow

Closed this issue ยท 7 comments

The data loader seems to extremely slow for few batches. After every few batches (like after 10 or 20 batches), it takes few seconds (up to 15s) to load the data. I have tried increasing the number of data loader workers (via option -j 12) and increasing the train batch size, but this issue persists. Is this issue expected? Is it because of the data transforms? This issue becomes severe when I run the code on more than one GPU. Most of the times, the GPU's remains idle which increases the overall time taken for one epoch (which for me is 1hr, 20 mins).

My machine configurations are:
4x1080Ti, Intel Xeon E5-2640, and I am loading the data from an SSD.

weigq commented

here about 15min a epoch with 1080ti and E5,
i think your issue may not be caused by the dataloader.
BTW, have you used cuda, cudnn?

@weigq Yes, I am using CUDA 8.0 and CUDNN 6. I don't think CUDA/CUDNN is an issue, as all my other codes are running just fine.

I have just tested it out another machine with same configuration and this slowness is still there. The slowness occurs only during the data loading part and not during the forward pass (which happens in few milliseconds).

This is a known issue, and I'm trying to fix it.

Basically, when all the loaded data are forwarded while the new data are not ready, the program needs to wait for the dataloader.

I think the augmentation part should be optimized (e.g. pose/datasets/mpii.py). I also try to figure out which part is the most time-consuming. If you guys have some suggestions, don't hesitate to create a pull request or leave your comments here.

@bearpaw
I used the torch version, it has the same problem. Though set the augmentation part ahead of crop operation in this pytorch version seems more appropriate but also more time-consuming. I have tried to remove the augmentation part, it has a bit of speedup.

Also, the np.linalg.inv operation is called 17 times. And the image reading is definitely time-consuming. So, the two main parts, image reading and image preprocessing are both time-consuming. And they wait for each other sometimes as we have seen.

It should be better to sperate the two parts with two Producer-Consumer model instead of just one.
For implementation, I think set a new dataloader on load_image mpii.py#L90 and load_annot mpii.py#L70-#L73 in training phase should be OK.

For evaluation phase in the above Framework, an extra image_id to prediction map should also be added.

Personal opinion, hope helpful.

@xmyqsh This a very comprehensive analysis. Thanks very much!

To @adityaarun1 @weigq @xmyqsh , here is a quick fix (at least allows you to train at the same speed with fewer workers)

As suggested by @xingyizhou, I missed an accelerating part as implemented in the original hourglass: https://github.com/anewell/pose-hg-train/blob/master/src/util/img.lua#L91-L105
(Because I follow the python code here: https://github.com/anewell/pose-hg-train/blob/master/src/pypose/img.py#L48-L78)

This has been fixed in my latest commit: 88a2294

The original code is a bit difficult to understand for me. I try to implement this part in my own understanding. The result seems correct ( 140 epoch for hg-stack2-block1 achieves 86.3 PCKh).

@bearpaw
In other words, zoom out of img = scipy.misc.imresize(img, [new_ht, new_wd]) which is a subsample operation is much faster than copy operation new_img[new_y[0]:new_y[1], new_x[0]:new_x[1]] = img[old_y[0]:old_y[1], old_x[0]:old_x[1]] in terms of the same size.

So, the big copy operation on the original image is a bottleneck. Right?

@bearpaw the fix works. ๐Ÿ‘
This thread on twitter might be of interest to us to further improve the efficiency of the data loader. People from FastAI have developed multi-processing + thread pool dataloader for PyTorch which works fine. But this adds an additional dependency to this repo.