aws-samples/parallelize-ml-inference

Error in Running

Opened this issue · 2 comments

The github code has some errors.

INFO:main:will process 10 images
INFO:utils:24 CPUs detected
INFO:utils:1 GPUs detected
INFO:main:Using GPU 0 on pid 11039
INFO:mxnet_model.mxnet_model_factory:MXNet model init
INFO:mxnet_model.mxnet_model_factory:Loading network parameters with prefix: ./resources/model/deploy_model_algo_1
[14:58:45] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.0. Attempting to upgrade...
[14:58:45] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
INFO:mxnet_model.mxnet_model_factory:Loading network into MXNet module and binding corresponding parameters
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ctr_rantipov/incubator-mxnet/python/mxnet/symbol/symbol.py", line 1733, in simple_bind
ctypes.byref(exe_handle)))
File "/home/ctr_rantipov/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [14:58:45] src/engine/./../common/cuda_utils.h:305: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: initialization error
Stack trace:
[bt] (0) /home/ctr_rantipov/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x22) [0x7fb22475b1f2]
[bt] (1) /home/ctr_rantipov/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0xc6) [0x7fb228393ba6]
[bt] (2) /home/ctr_rantipov/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0xd9) [0x7fb2283b2af9]

Could you help me please?

r2d3 commented

I found a solution. CUDA should not be initialized before the process fork.

So you should import mxnet only in the computing process.

The other thing is the way the process is launched, I added:

multiprocessing.set_start_method('forkserver', force=True)

to get my code working

Thanks for your help.