Does not run in Pytorch 1.3.1
michaelklachko opened this issue · 0 comments
I just cloned your repo and when I'm launching the command:
CUDA_VISIBLE_DEVICES=2,3,4,5 python imagenet.py -a mobilenetv2 -d /path/to/dataset/ImageNet2012/ --epochs 150 --lr-decay cos --lr 0.05 --wd 4e-5 -c checkpoints --width-mult 1 --input-size 224 -j 12
It gets stuck at this point:
=> creating model 'mobilenetv2'
Epoch: [1 | 150]
Processing
<Ctrl+C pressed after 10 min of nothing happening:>
^CTraceback (most recent call last):
File "imagenet.py", line 403, in <module>
main()
File "imagenet.py", line 224, in main
train_loss, train_acc = train(train_loader, train_loader_len, model, criterion, optimizer, epoch)
File "imagenet.py", line 271, in train
for i, (input, target) in enumerate(train_loader):
File "/home/michael/mobilenetv2.pytorch/utils/dataloaders.py", line 190, in prefetched_loader
for next_input, next_target in loader:
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
success, data = self._try_get_data()
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/michael/miniconda2/envs/pt/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/michael/miniconda2/envs/pt/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Nothing is happening at this point. nvidia-smi shows that a single GPU consumes ~500M of memory, and CPU cores are ~60% busy, but it's not clear what are they doing. I waited for 10 minutes before aborting. I also tried it on a single GPU - same issue.
If I switch to --data-backend dali-cpu
(using nvidia-dali version 0.16) it fails with the following error:
=> creating model 'mobilenetv2' Traceback (most recent call last): File "imagenet.py", line 403, in <module> main() File "imagenet.py", line 194, in main train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, workers=args.workers, input_size=args.input_size) TypeError: gdtl() got an unexpected keyword argument 'input_size'
I'm using Pytorch 1.3.1 with 4x Titan Xp cards. The only thing I had to change in your code is to replace cuda(async=True)
with cuda(non_blocking=True)
. Changing tonon_blocking=False
does not help.
Can you please try cloning your repo to a clean Pytorch 1.3.1 environment and see if you can run it? Any idea what's going on?