crack detection: runtime errors with train_model()
DogmaF opened this issue · 2 comments
DogmaF commented
RE the Jupyter file for the crack detection project: I'm get runtime errors at cell [34], when I try to train the model. It seems to have something to do with signal handling. The last item in the error hierarchy is:
RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.
To simplify debugging, I tried running it with zero epochs. Here are the error statements generated when I do that.
(I also found that I needed to add a line for import torchsummary
, and move %matplotlib inline
to the top of the import list to overcome other errors.)
This is on a Mac (OSX 10.15.4
) with Python 3.7.6
and pytorch 1.4.0
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-34-51af14cba900> in <module>
1 base_model = train_model(resnet50, criterion, optimizer, exp_lr_scheduler, num_epochs=0)
----> 2 visualize_model(base_model)
3 plt.show()
<ipython-input-25-8be992550be9> in visualize_model(model, num_images)
6
7 with torch.no_grad():
----> 8 for i, (inputs, labels) in enumerate(dataloaders['val']):
9 inputs = inputs.to(device)
10 labels = labels.to(device)
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __iter__(self)
277 return _SingleProcessDataLoaderIter(self)
278 else:
--> 279 return _MultiProcessingDataLoaderIter(self)
280
281 @property
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
744 # prime the prefetch loop
745 for _ in range(2 * self._num_workers):
--> 746 self._try_put_index()
747
748 def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_put_index(self)
870 return
871
--> 872 self._index_queues[worker_queue_idx].put((self._send_idx, index))
873 self._task_info[self._send_idx] = (worker_queue_idx,)
874 self._tasks_outstanding += 1
~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in put(self, obj, block, timeout)
85 with self._notempty:
86 if self._thread is None:
---> 87 self._start_thread()
88 self._buffer.append(obj)
89 self._notempty.notify()
~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in _start_thread(self)
157
158 # Start thread which transfers data from buffer to pipe
--> 159 self._buffer.clear()
160 self._thread = threading.Thread(
161 target=Queue._feed,
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
64 # This following call uses `waitid` with WNOHANG from C side. Therefore,
65 # Python can still get and update the process status successfully.
---> 66 _error_if_any_worker_fails()
67 if previous_handler is not None:
68 previous_handler(signum, frame)
RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.
priya-dwivedi commented
The PID killed by signal error is a catch all that just indicates code is
not running. Sorry I know it doesn't help. From my previous experience I
strongly suspect it can be a dependency issue. Try setting up an envtt with
the latest pytorch version and trying again
…On Wed, Apr 1, 2020 at 5:11 PM Mike Fuller ***@***.***> wrote:
RE the Jupyter file for the *crack detection* project: I'm get runtime
errors at cell [34], when I try to train the model. It seems to have
something to do with signal handling. The last item in the error hierarchy
is:
RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown
signal: 0.
To simplify debugging, I tried running it with *zero epochs*. Here are
the error statements generated when I do that.
(I also found that I needed to add a line to import torchsummary, and
move %matplotlib inline to the top of the import list to overcome other
errors.)
This is on a Mac (OSX 10.15.4) with Python 3.7.6 and pytorch 1.4.0
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-34-51af14cba900> in <module>
1 base_model = train_model(resnet50, criterion, optimizer, exp_lr_scheduler, num_epochs=0)
----> 2 visualize_model(base_model)
3 plt.show()
<ipython-input-25-8be992550be9> in visualize_model(model, num_images)
6
7 with torch.no_grad():
----> 8 for i, (inputs, labels) in enumerate(dataloaders['val']):
9 inputs = inputs.to(device)
10 labels = labels.to(device)
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __iter__(self)
277 return _SingleProcessDataLoaderIter(self)
278 else:
--> 279 return _MultiProcessingDataLoaderIter(self)
280
281 @Property
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
744 # prime the prefetch loop
745 for _ in range(2 * self._num_workers):
--> 746 self._try_put_index()
747
748 def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_put_index(self)
870 return
871
--> 872 self._index_queues[worker_queue_idx].put((self._send_idx, index))
873 self._task_info[self._send_idx] = (worker_queue_idx,)
874 self._tasks_outstanding += 1
~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in put(self, obj, block, timeout)
85 with self._notempty:
86 if self._thread is None:
---> 87 self._start_thread()
88 self._buffer.append(obj)
89 self._notempty.notify()
~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in _start_thread(self)
157
158 # Start thread which transfers data from buffer to pipe
--> 159 self._buffer.clear()
160 self._thread = threading.Thread(
161 target=Queue._feed,
~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
64 # This following call uses `waitid` with WNOHANG from C side. Therefore,
65 # Python can still get and update the process status successfully.
---> 66 _error_if_any_worker_fails()
67 if previous_handler is not None:
68 previous_handler(signum, frame)
RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#104>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4MODERQUDABOUJMOE2QLRKOUWLANCNFSM4LZSO6NA>
.
DogmaT2 commented
Okay, thanks for the quick response! I will give it a try.