"terminate called after throwing an instance of 'c10::Error'"

Question

"terminate called after throwing an instance of 'c10::Error'"

Closed this issue 4 years ago · 9 comments

Hello,

Thanks for sharing the code. When I tried python train.py --config configs/maml/halfcheetah-vel.yaml --output-folder maml-halfcheetah-vel --seed 1 --num-workers 8,

It gave me this error,
"terminate called after throwing an instance of 'c10::Error'"

I checked all the requirements are satisfied. What could be the problem?

Thanks

Answer 1 · 2020-01-21T21:12:01.000Z

Hi, I have never encountered this error before, and I have no idea what could cause it unfortunately.
Does it happen right away when you launch the script, or after running for a while?

Answer 2 · 2020-01-22T00:29:49.000Z

Hello,

It happened a little while after I launch the script.

I think it has something to do with the multiprocessing, any guesses? Thanks

Answer 3 · 2020-01-25T02:25:29.000Z

This could be a multiprocessing issue. All the search results I get are related to some C++ code, so I'm afraid that could be internal to PyTorch (and the relation with multiprocessing). I'm sorry I'm not of much help, I'll keep looking.

Answer 4 · 2020-04-16T03:44:01.000Z

@wyshi, I am getting the same error. Have you resolved the issue?

Answer 5 · 2020-04-17T14:29:50.000Z

Can you give more details about your setup (OS/Python version/PyTorch version)?

Answer 6 · 2020-04-21T16:40:21.000Z

Ubuntu, 16.04.6 LTS/Python 3.7.6/Pytorch 1.3.1

…

On Fri, Apr 17, 2020 at 7:30 AM Tristan Deleu ***@***.***> wrote: Can you give more details about your setup (OS/Python version/PyTorch version)? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACB26WSFEJIHEY7V5XAPVVDRNBRW5ANCNFSM4KJLEW6Q> .

Answer 7 · 2020-04-24T13:10:32.000Z

Can someone provide the full traceback with this error? With possibly what follows this error (frame #1 ...?)

Answer 8 · 2020-05-22T03:18:57.000Z

This is the full trackback. Thanks!

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error (setDevice at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/cuda/impl/CUDAGuardImpl.h:42)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7efcc564c627 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xecf2 (0x7efcc5880cf2 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7efccaf3c419 in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7efccaf3cd9a in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7efcf6a98faa in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xc819d (0x7efcf638519d in /efs/qinsun/anaconda3/lib/python3.7/site-packages/torch/../../../libstdc++.so.6)
frame #6: + 0x76ba (0x7efd057c56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7efd054fb41d in /lib/x86_64-linux-gnu/libc.so.6)

Answer 9 · 2020-05-22T09:36:23.000Z

It looks like this is a CUDA error. It could be a problem with the multiprocessing context, and maybe adding mp.set_start_method('spawn') would solve this issue.
I would suggest running the code using CPU instead (the networks are small enough that this shouldn't be a bottleneck), this code was not tested using GPU.