tristandeleu/pytorch-maml-rl

"terminate called after throwing an instance of 'c10::Error'"

Closed this issue · 9 comments

wyshi commented

Hello,

Thanks for sharing the code. When I tried python train.py --config configs/maml/halfcheetah-vel.yaml --output-folder maml-halfcheetah-vel --seed 1 --num-workers 8,

It gave me this error,
"terminate called after throwing an instance of 'c10::Error'"

I checked all the requirements are satisfied. What could be the problem?

Thanks

Hi, I have never encountered this error before, and I have no idea what could cause it unfortunately.
Does it happen right away when you launch the script, or after running for a while?

wyshi commented

Hello,

It happened a little while after I launch the script.

I think it has something to do with the multiprocessing, any guesses? Thanks

This could be a multiprocessing issue. All the search results I get are related to some C++ code, so I'm afraid that could be internal to PyTorch (and the relation with multiprocessing). I'm sorry I'm not of much help, I'll keep looking.

@wyshi, I am getting the same error. Have you resolved the issue?

Can you give more details about your setup (OS/Python version/PyTorch version)?

Can someone provide the full traceback with this error? With possibly what follows this error (frame #1 ...?)

This is the full trackback. Thanks!

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error (setDevice at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/cuda/impl/CUDAGuardImpl.h:42)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7efcc564c627 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xecf2 (0x7efcc5880cf2 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7efccaf3c419 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7efccaf3cd9a in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7efcf6a98faa in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xc819d (0x7efcf638519d in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/../../../libstdc++.so.6)
frame #6: + 0x76ba (0x7efd057c56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7efd054fb41d in /lib/x86_64-linux-gnu/libc.so.6)

It looks like this is a CUDA error. It could be a problem with the multiprocessing context, and maybe adding mp.set_start_method('spawn') would solve this issue.
I would suggest running the code using CPU instead (the networks are small enough that this shouldn't be a bottleneck), this code was not tested using GPU.