MLOPTPSU/FedTorch

Does BrokenPipeError Matters?

Closed this issue · 1 comments

Hi, I am very interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I follow your repo and use your docker images to run the code, but I keep getting some errors as follows after some random starting round of the traning.

Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
send_bytes(obj)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

After all, I still can get a result, but I wonder whether the brokenpipeerror matters, am I still getting the correct results?
Hope to hear your response soon.

Hi. Thanks for your interest. Generally, that does not affect the training result, but it may delay the training process. The reason is that you are requesting too many nodes that the CPU cannot handle very well. Since the code is using the oversubscription method of the MPI to allow more processes to run than the number of cores this might delay the training overall. Consider running on a CPU with more cores or multiple CPU nodes.