dotchen/LearningByCheating

Runtime error during phase 0 training

peiyunh opened this issue · 10 comments

Hi @dianchen96 and @bradyz

I am at the stage 0 of training an image agent. There is a runtime error that looks related to a bug of PyTorch with Python 3.5. I am able to train once I set num_workers=0 but I am wondering if you know another way around that does not sacrifice training speed. Thanks!

Please find the error messages below.

(lbc) peiyunh@ubuntu:~/code/lbc/training$ CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/home/peiyunh/software/CARLA_0.9.6/PythonAPI" python train_image_phase0.py --log_dir ../ckpts/image_phase0 --pretrained --teacher_path ../ckpts/priveleged/model-128.th --dataset_dir ../data
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
augment with  None
Finished loading ../data/train. Length: 167789
augment with  None
Finished loading ../data/val. Length: 52600
Loading ResNet weights from : https://download.pytorch.org/models/resnet34-333f7ec4.pth
Epoch:   0%|                                                                                                     | 0/3 [00:00<?, ?it/sException in thread Thread-4:                                                                                    | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range


Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/multiprocessing/resource_sharer.py", line 139, in _serve
    signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
  File "/home/peiyunh/miniconda3/envs/lbc/lib/python3.5/signal.py", line 60, in pthread_sigmask
    sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

Also, what FPS should we expect to reach during training? With num_workers=0, I have about 0.4FPS for training and 0.8 FPS for validation. At this speed, the phase 1 of training (256 epochs) takes 6 days to finish. Does that sound too slow? How long in your experience does it take to train for each phase? Thanks!

What pytorch version do you have?

I tried both 1.5.1 and 1.0.0. Both run into this error when num_workers is set above zero.

That's odd. I was using Ubuntu 14.04/16.04 with Python 3.5 + PyTorch 1.2 when working on this project, never run into this issue. Maybe try upgrading python to 3.6?

Would I need a new egg file for using Python 3.6?

The 3.5 egg should be compatible with 3.6

Thanks so much @dianchen96 . Switching to Python 3.6 solves the issue. I am now able to train with num_workers=8. There seems to be a 4x speed up. This means phase 1 training will likely take 1.5 days to finish. Does that sound right to you?

That looks good. I'd recommend first trying lower epoch (e.g. 32) phase 1 model and see how they work. The phase 1 numbers listed on index.md come from a 32 model. P.S you might need to slightly tune the steering PID parameters.

Great to know. Will try that.

Do you by any chance plan to release a checkpoint model for each phase? I am very interested in reproducing the perforamnce and running diagnostics on the intermediate models. Having a reference would be really helpful for me to make sure I am on the right track.

We have released our birdview and phase 2 checkpoints, and we do not benchmark phase 0 model as its sole purpose is to make sure the gradient for phase 1 do not go NaN (due to the reprojection). For phase 1 model performance you can refer to the one on index.md.