lcswillems/rl-starter-files

Broken pipe when training a model on CPU

oceank opened this issue · 11 comments

Hi,

I followed the instructions in README.md to train a A2C agent in DoorKey environment using the following command (Python 3.7.3) in Ubuntu 18.04 with 8 CPUs.

python scripts/train.py --algo a2c --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000

The train went well initially but ended with a BrokenPipeError exception that crashes the training process. The error message is copied below. According to scripts/train.py, the above command will run with 16 processes. Initially, I thought the error was because the training initialized too many processes. But even when setting --procs=6, the same exception happened again. Only when setting --procs=1, the training ran successfully. Is there any special setting I should do to enable the training with multi-processes?

(Just realized that the error roots in torch_ac)

Error Message

Exception ignored in: <function ParallelEnv.__del__ at 0x7f2df3411a60>
Traceback (most recent call last):
  File "~/torch-ac/torch_ac/utils/penv.py", line 41, in __del__
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 206, in send
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
BrokenPipeError: [Errno 32] Broken pipe

Thank you @oceank for raising this issue!!

However, I can't reproduce it...
Could you say me at which point in the training does it fail? Is it always at the same point? It seems there is an issue with the parallelization.
Could you also say me which OS do you have? Which version of Python do you have? Do you run it on GPU?

I also don't have time currently to investigate this issue in depth. So if the error persists, I would advise you to try another library.

Hi, @lcswillems,

I ran the code in Ubuntu 18.04 and Python 3.7.3 without GPU. I can not tell where in the training the error was triggered yet. I will check it out.

Hi,

Just adding that I reproduced this with the same command on Ubuntu 18.04.4, Python 3.8.5 without GPU; I believe the broken pipe happens right at the end of training, as the output right before the exception is

U 40 | F 081920 | FPS 0362 | D 294 | rR:μσmM 0.93 0.03 0.81 0.97 | F:μσmM 20.8 8.5 8.0 52.0 | H 1.335 | V 0.807 | pL -0.016 | vL 0.002 | ∇ 0.035
Status saved

which is over 80000 frames. I don't know much about this, but it might suggest that this could be a minor bug (?).

I am sorry but I can't reproduce the bug... I have tried @oceank command but no problem for me.
If somebody is able to give the command that fails for him/her and his/her configuration, it would be great!

I recently started getting this error. And it happens right at the end of training.
I didnt get this error before. Not sure what the problem is.

@bharatprakash What do you mean by "recently"? Did it start yesterday?

@lcswillems Sorry I should have been more clear.
I have a copy of the repo which a cloned a few months ago which works fine. (both this and torch-ac)

I cloned this repo(and torch-ac) again last week again on the same server for a diff experiment I'm doing. And now I see this error.

I see that this is a new commit on torch-ac and thats where @oceank and I see the error.
lcswillems/torch-ac@64833c6

Thank you for these details! It could be related to the commit you link but I am not able to reproduce the issues.

Do you have a way to reproduce? If you have one, could you go to the commit before this one: lcswillems/torch-ac@64833c6 and tell me if you still get the error?

@lcswillems lcswillems/torch-ac@64833c6 works well. The broken pipe error is gone in my local run. Thanks for the help.

@oceank I reverted the commit. Could you tell me if the latest version of torch-ac works fine for you?

I am closing this issue because I think I fixed the issue. @oceank , if I didn't, please tell me and I will reopen the issue.