Broken pipe error
amiltonwong opened this issue · 3 comments
Hi, all,
I got the following broken pipe
error when the training came to the final epoch (epoch=499):
in epoch 498
max_epoch 500
**** EPOCH 498 ****
2019-02-26 05:38:35.735413
Progress: [##########] 100%mean loss: 0.082965
Overall accuracy : 0.991698
Average IoU : 0.963038
IoU of man-made terrain : 0.970558
IoU of natural terrain : 0.981679
IoU of high vegetation : 0.994769
IoU of low vegetation : 0.937876
IoU of buildings : 0.993800
IoU of hard scape : 0.939500
IoU of scanning artifact : 0.923614
IoU of cars : 0.962506
in epoch 499
max_epoch 500
**** EPOCH 499 ****
2019-02-26 05:39:40.278005
Progress: [##########] 100%mean loss: 0.077413
Overall accuracy : 0.992196
Average IoU : 0.962089
IoU of man-made terrain : 0.971449
IoU of natural terrain : 0.982647
IoU of high vegetation : 0.996347
IoU of low vegetation : 0.935048
IoU of buildings : 0.994343
IoU of hard scape : 0.937210
IoU of scanning artifact : 0.921430
IoU of cars : 0.958242
Process ForkPoolWorker-1:1:
Traceback (most recent call last):
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/root/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
(tf) root@milton-ThinkCentre-M93p:/data/code8/Open3D-PointNet2-Semantic3D#
Is something configured wrong or just to ignore such error?
My environment is:
tensorflow 1.12
cuda 9.0 + cudnn 7.5
Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.
Yeah, this is a known issue, potentially due to subprocess not properly terminated. The results shall be fine though as it only happens after the final iteration.
I also encountered the same problem, but it happened when the epoch was 25. How can I solve it?
One solution may be to use ThreadPool instead of Pool in the training script. so for fill_queues
function in train.py
from multiprocessing.pool import ThreadPool
import multiprocessing as mp
###### Portion with fill_queues function
def fill_queues(
stack_train, stack_validation, num_train_batches, num_validation_batches
):
"""
Args:
stack_train: mp.Queue to be filled asynchronously
stack_validation: mp.Queue to be filled asynchronously
num_train_batches: total number of training batches
num_validation_batches: total number of validationation batches
"""
pool = ThreadPool(mp.cpu_count())
###### Fill in remaining code
See if that works.