Runtime error during phase 0 training
peiyunh opened this issue · 10 comments
Hi @dianchen96 and @bradyz
I am at the stage 0 of training an image agent. There is a runtime error that looks related to a bug of PyTorch with Python 3.5. I am able to train once I set num_workers=0
but I am wondering if you know another way around that does not sacrifice training speed. Thanks!
Please find the error messages below.
(lbc) peiyunh@ubuntu:~/code/lbc/training$ CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/home/peiyunh/software/CARLA_0.9.6/PythonAPI" python train_image_phase0.py --log_dir ../ckpts/image_phase0 --pretrained --teacher_path ../ckpts/priveleged/model-128.th --dataset_dir ../data
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
augment with None
Finished loading ../data/train. Length: 167789
augment with None
Finished loading ../data/val. Length: 52600
Loading ResNet weights from : https://download.pytorch.org/models/resnet34-333f7ec4.pth
Epoch: 0%| | 0/3 [00:00<?, ?it/sException in thread Thread-4: | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Also, what FPS should we expect to reach during training? With num_workers=0
, I have about 0.4FPS for training and 0.8 FPS for validation. At this speed, the phase 1 of training (256 epochs) takes 6 days to finish. Does that sound too slow? How long in your experience does it take to train for each phase? Thanks!
What pytorch version do you have?
I tried both 1.5.1 and 1.0.0. Both run into this error when num_workers
is set above zero.
That's odd. I was using Ubuntu 14.04/16.04 with Python 3.5 + PyTorch 1.2 when working on this project, never run into this issue. Maybe try upgrading python to 3.6?
Would I need a new egg file for using Python 3.6?
The 3.5 egg should be compatible with 3.6
Thanks so much @dianchen96 . Switching to Python 3.6 solves the issue. I am now able to train with num_workers=8
. There seems to be a 4x speed up. This means phase 1 training will likely take 1.5 days to finish. Does that sound right to you?
That looks good. I'd recommend first trying lower epoch (e.g. 32) phase 1 model and see how they work. The phase 1 numbers listed on index.md come from a 32 model. P.S you might need to slightly tune the steering PID parameters.
Great to know. Will try that.
Do you by any chance plan to release a checkpoint model for each phase? I am very interested in reproducing the perforamnce and running diagnostics on the intermediate models. Having a reference would be really helpful for me to make sure I am on the right track.
We have released our birdview and phase 2 checkpoints, and we do not benchmark phase 0 model as its sole purpose is to make sure the gradient for phase 1 do not go NaN (due to the reprojection). For phase 1 model performance you can refer to the one on index.md.