"terminate called after throwing an instance of 'c10::Error'"
Closed this issue · 9 comments
Hello,
Thanks for sharing the code. When I tried python train.py --config configs/maml/halfcheetah-vel.yaml --output-folder maml-halfcheetah-vel --seed 1 --num-workers 8
,
It gave me this error,
"terminate called after throwing an instance of 'c10::Error'"
I checked all the requirements are satisfied. What could be the problem?
Thanks
Hi, I have never encountered this error before, and I have no idea what could cause it unfortunately.
Does it happen right away when you launch the script, or after running for a while?
Hello,
It happened a little while after I launch the script.
I think it has something to do with the multiprocessing, any guesses? Thanks
This could be a multiprocessing issue. All the search results I get are related to some C++ code, so I'm afraid that could be internal to PyTorch (and the relation with multiprocessing). I'm sorry I'm not of much help, I'll keep looking.
Can you give more details about your setup (OS/Python version/PyTorch version)?
Can someone provide the full traceback with this error? With possibly what follows this error (frame #1 ...
?)
This is the full trackback. Thanks!
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error (setDevice at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/cuda/impl/CUDAGuardImpl.h:42)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7efcc564c627 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xecf2 (0x7efcc5880cf2 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7efccaf3c419 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7efccaf3cd9a in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7efcf6a98faa in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xc819d (0x7efcf638519d in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/../../../libstdc++.so.6)
frame #6: + 0x76ba (0x7efd057c56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7efd054fb41d in /lib/x86_64-linux-gnu/libc.so.6)
It looks like this is a CUDA error. It could be a problem with the multiprocessing context, and maybe adding mp.set_start_method('spawn')
would solve this issue.
I would suggest running the code using CPU instead (the networks are small enough that this shouldn't be a bottleneck), this code was not tested using GPU.