CUDNN_STATUS_INTERNAL_ERROR while running main.py

Hi thanks for previous feedback in another thread. After I setup up Cuda8.0/CuDNN 5.1 and theano 0.9, I can run some part of main.py. But there's still some error when executing patch2embedding() function in the early rejection stage.

More specifically:

Traceback (most recent call last):
  File "./main.py", line 27, in <module>
    save_npz_file_path = main_reconstruct.reconstruction(datasetFolder, _model, imgNamePattern, poseNamePattern, outputFolder, N_viewPairs4inference, resol, BB, viewList)
  File "/home/ICT2000/tli/Workspace/SurfaceNet/main_reconstruct.py", line 77, in reconstruction
    cubeCenter_hw = np.stack([img_h_cubesCenter, img_w_cubesCenter], axis=0))    # (N_cubes, N_views, D_embedding), (N_cubes, N_views)
  File "./utils/earlyRejection.py", line 31, in patch2embedding
    patches_embedding[:,:] = patch2embedding_fn(patch_allBlack)[0] # don't use np.repeat (out of memory)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR
Apply node that caused the error: GpuDnnConv{algo='small', inplace=False}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode=(1, 1), subsample=(1, 1), conv_mode='cross', precision='float32'}.0, Cast{float32}.0, Cast{float32}.0)
Toposort index: 276
Inputs types: [GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), <theano.gof.type.CDataType object at 0x7fbd6848bc90>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1, 3, 64, 64), (64, 3, 3, 3), (1, 64, 64, 64), 'No shapes', (), ()]
Inputs strides: [(49152, 16384, 256, 4), (108, 36, 12, 4), (1048576, 16384, 256, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7fbb43bd10c0>, 1.0, 0.0]
Inputs type_num: [11, 11, 11, '', 11, 11]
Outputs clients: [[HostFromGpu(gpuarray)(GpuDnnConv{algo='small', inplace=False}.0)]]

Detail error log can be seen here:
err_log.txt

I have tried:

Delete Theano cache theano-cache purge or rm -rf ./.theano
Adjust CNMeM: (https://devtalk.nvidia.com/default/topic/950158/cnmem-limitations-when-using-cudnn/)
Remove nvidia cache and reboot: (https://stackoverflow.com/questions/45810356/runtimeerror-cudnn-status-internal-error)

None has worked so far.

Have you seen this type of error before? Or did I set my computer correctly?
I observed you have a params.py to specify all parameters. Some has mentioned this error can result from lack of memory (link), and it seems your code did something for batch processing.

Info of my setting:

Ubuntu 16.04
CUDA 8.0 / CuDNN 5.1
GPU: Nvidia 1080 Ti (11GB memory) --- I also tried on another machine with Titan X, not working
theano 0.9

My ~/.theanorc:

[global] 
floatX=float32 
device=cuda0
optimizer=None
 
allow_gc=True 
#gpuarray.preallocate=0.95
gcc.cxxflags=-Wno-narrowing
exception_verbosity=high

[lib]
cnmem=0.75

[nvcc]
nvcc.fastmath=True 

[cuda] 
root=/usr/local/cuda-8.0

If you have any suggestions, please let me know! Thanks for your help and support!

Update:

After I tried to remove other versions of CuDNN: (https://groups.google.com/forum/#!topic/theano-users/w4M3Xy0ec60), the error changes to the following.

Traceback (most recent call last):
  File "./main.py", line 27, in <module>
    save_npz_file_path = main_reconstruct.reconstruction(datasetFolder, _model, imgNamePattern, poseNamePattern, outputFolder, N_viewPairs4inference, resol, BB, viewList)
  File "/home/ICT2000/tli/Workspace/SurfaceNet/main_reconstruct.py", line 77, in reconstruction
    cubeCenter_hw = np.stack([img_h_cubesCenter, img_w_cubesCenter], axis=0))    # (N_cubes, N_views, D_embedding), (N_cubes, N_views)
  File "./utils/earlyRejection.py", line 48, in patch2embedding
    _patches_embedding_inScope[_batch] = patch2embedding_fn(_patches_preprocessed[_batch])     # (N_batch, 3/1, patchSize, patchSize) --> (N_batch, D_embedding). similarityNet: patch --> embedding
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
  File "pygpu/gpuarray.pyx", line 676, in pygpu.gpuarray.pygpu_empty
  File "pygpu/gpuarray.pyx", line 290, in pygpu.gpuarray.array_empty
pygpu.gpuarray.GpuArrayException: cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Apply node that caused the error: GpuDnnConv{algo='small', inplace=False}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode=(1, 1), subsample=(1, 1), conv_mode='cross', precision='float32'}.0, Cast{float32}.0, Cast{float32}.0)
Toposort index: 276
Inputs types: [GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), <theano.gof.type.CDataType object at 0x7f703a019c90>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1100, 3, 64, 64), (64, 3, 3, 3), (1100, 64, 64, 64), 'No shapes', (), ()]
Inputs strides: [(49152, 16384, 256, 4), (108, 36, 12, 4), (1048576, 16384, 256, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7f6e133930c0>, 1.0, 0.0]
Inputs type_num: [11, 11, 11, '', 11, 11]
Outputs clients: [[HostFromGpu(gpuarray)(GpuDnnConv{algo='small', inplace=False}.0)]]

@Rubikplayer
For the updated error_log, it mentions: pygpu.gpuarray.GpuArrayException: cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory. Can you change the cnmem=0.75 --> cnmem=0.95 in .theanorc OR change __GPUMemoryGB = 11 to a safe value, say __GPUMemoryGB = 6 in params.py and let's see what it print out.

Also, for the theano installation please refer to #3 (comment)

@mjiUST
The code seems to be running, after I set gpuarray.preallocate=0.8 (also commented #cnmem=0.75). (This was before I saw your feedback. I will try your suggested values a bit later).

May I confirm with you on two questions:

Theano/Lasagne is quite new to me. I wasn't quite sure the difference between gpuarray.preallocate and cnmem.

According to the theano doc link, seems gpuarray.preallocate was designed for new gpu back, and cnmem for the old one. Since we are using version 0.9, I suppose I should set cnmem instead of gpuarray.preallocate? If so, then what I just set was just not setting any limit.

With my setting above, it seems to run on the example dinosaur data. About 2 hours passed, it finished 68% in surfacenet inference. Is this typical, or there's any way to make it faster?

My setting change: __GPUMemoryGB = 11 and __cube_D = 32.
Also, my GPU (1080 Ti) should be slower than Titan X.

Thanks for the help!!

@Rubikplayer
Thanks for your feedback. It's great to know the code is running.

For the theano memory preallocation, the link you mentioned says that after you set the Theano flag allow_gc to False (Theano will not collect GPU memory garbage.), CNMeM will not affect GPU speed anymore. In my opinion, CNMeM and gpuarray.preallocate are the same thing for older and newer versions. Just use any one which let the GPU memory preallocated in the very beginning (you can use command watch nvidia-smi to check, i.e., the majority memory was reserved.)

For the speed of SurfaceNet: the setting __cube_D = 64 could result in a little bit faster process. Before that you can check whether your .theanorc include optimizer=fast_run for fast running mode as mentioned in

SurfaceNet/installEnv.sh

Line 40 in 149f6e0

    
           echo -e "[global] \nfloatX=float32 \ndevice=cuda \noptimizer=fast_run \n \nallow_gc=True \ngpuarray.preallocate=-1 \n \nnvcc.fastmath=True \n" > ~/.theanorc

If everything goes well, the dinosaur dataset should finish in one hour.

@mjiUST
Thanks for the suggestion! I tried optimizer=fast_run indeed accelerates the process. but for __cube_D = 64, I still got some out of memory issue. I've sent an email to your school email for detail questions.

Thanks again!