princeton-vl/pose-hg-train

bad argument #6 to 'sub'

wydges opened this issue · 7 comments

Hi, anewell! I've met this issue during training from scratch on different epochs (4,5,12) after latest commits.
==> Starting epoch: 12/100 torch/install/bin/luajit: /opt/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] pose-hg-train/src/util/img.lua:115: bad argument #6 to 'sub' (out of range at torch/pkg/torch/generic/Tensor.c:330)

Any ideas about the ways it was caused and the ways it could be fixed?
Thank you in advance

I am having the same problem too. Have you manged to figure out the problem?

@yxchng Still didn't, I have this trouble on both 4-stack and 8-stack models.
I have cuDNN version 5.1 and CUDA 7.5, and it's all running on Ubuntu 16.04

As a temporary solution, try using the crop2 function instead. (edit the call to crop in pose.lua) I'll try to get a proper fix up soon

@anewell hi crop2 is failing at epoch 37. Same error.

Sorry you ran into that, I've looked into it further and made some modifications that should take care of the problem once and for all. Just pushed the update, let me know if it still has issues. I've also added a protection so that on the off chance there still is a bug it will prevent the whole thing from crashing.

@anewell just wondering do you tree running the code on your computer? My training actually get stucked but without any error (like frozen). Not sure if it is my server problem though

@anewell , thank you! For now everything seems running okay. Also, I didn't met problems with crop2 function, as @yxchng had