dkkim93/meta-mapg

TypeError: can't pickle _thread.RLock objects

Waiting-TT opened this issue · 5 comments

When training the model, I encountered the following error.
Traceback (most recent call last):
File "/home/xxx/ww/meta-mapg-main/main.py", line 163, in
main(args=args)
File "/home/xxx/ww/meta-mapg-main/main.py", line 60, in main
p.start() # TypeError: can't pickle _thread.RLock objects
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle _thread.RLock objects

Could you please tell me how to solve this problem? Thanks a lot!

Hello! I appreciate your interest in our paper. :-)
I just ran the code on a GCP server with virtualenv (python version 3.6) and did not experience the above issue.

Instead of conda, would it be possible to try again with virtualenv (python version 3.6)?
Additionally, could you share the requirements.txt file from your virtual environment with me and check whether the versions match the ones in this file?

The above issue might be caused due to a version mismatch between libraries. Thanks!

Thank you so much and this is my environment!
requirements.txt

I found that "shared_meta_agent" and "log" in main.py line 48 can't use pickle. I wonder if this is the reason, and how to solve it? Thanks again!

(absl-py==1.4.0
cachetools==4.2.4
certifi==2021.5.30
cffi==1.15.1
charset-normalizer==2.0.12
Cython==0.29.34
dataclasses==0.8
distlib==0.3.6
fasteners==0.18
filelock==3.4.1
gitdb==4.0.9
gitdb2==4.0.2
GitPython==3.0.8
glfw==2.5.9
google-auth==2.17.3
google-auth-oauthlib==0.4.6
grpcio==1.48.2
gym==0.12.5
idna==3.4
imageio==2.15.0
importlib-metadata==4.8.3
importlib-resources==5.4.0
Markdown==3.3.7
mujoco-py==2.1.2.14
numpy==1.19.5
oauthlib==3.2.2
Pillow==8.4.0
platformdirs==2.4.0
protobuf==3.19.6
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pyglet==2.0.5
PyYAML==3.12
requests==2.27.1
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.5.4
six==1.16.0
smmap==5.0.0
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==1.2
torch==1.4.0
typing_extensions==4.1.1
urllib3==1.26.15
virtualenv==20.17.1
Werkzeug==2.0.3
zipp==3.6.0)

My Python version is 3.6.5.

In our paper, we use distributed training to speed up the meta-optimization, where the shared_meta_agent is shared between multiple processes and is updated asynchronously.
The experienced issue above is related to the multiprocessing part.

I would like to ask the following questions to understand better why our code is not working on your environment:

  1. We tested our code based on a Linux OS server (Ubuntu 20.04). Which OS would you be using?
  2. By looking at the error message above, the pickle issue directly arises from Python's multiprocessing library (/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/process.py) and does go through PyTorch's multiprocessing library. In main.py, could you double-check whether you are using import torch.multiprocessing as mp instead of import multiprocessing as mp? Because we are sharing the PyTorch model across processes, we would like to use torch.multiprocessing.
  3. Lastly, the distributed training part is implemented based on the popular A3C code (repository). Could you double-check whether your environment can run the referred A3C code? As in our code, the A3C code also uses the torch.multiprocessing (link) and share_memory (link) to enable the distributed training.

Thanks!

I will close this issue :) If this issue remains, please feel free to re-open. Thank you.