sony/ai-research-code

【NVC-Net】RuntimeError: target_specific error in backward_impl. Failed `status == CUDNN_STATUS_SUCCESS`: UNKNOWN

Kanraaaaa opened this issue · 1 comments

Hi, I try to train NVC-Net on single gpu, but I meet some errors as follows:

value error in query
/home/gitlab-runner/builds/jmdP2aBr/1/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:69
Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same.
2022-02-15 17:16:13,887 [nnabla][INFO]: Training data with 100 speakers.
2022-02-15 17:16:13,888 [nnabla][INFO]: DataSource with shuffle(True)
2022-02-15 17:16:13,934 [nnabla][INFO]: Using DataIterator
Running epoch=1 lr=0.00010
Error during backward propagation:
Add2CudaCudnn
Add2CudaCudnn
Add2CudaCudnn
MulScalarCuda
MeanCudaCudnn
SquaredErrorCuda
Div2Cuda
PowScalarCuda
SumCuda
AddScalarCuda
PowScalarCuda
ConvolutionCudaCudnn
PadCuda
GELUCuda
ConvolutionCudaCudnn
PadCuda
GELUCuda
ConvolutionCudaCudnn
GELUCuda
Add2CudaCudnn
ConvolutionCudaCudnn
Mul2Cuda
TanhCudaCudnn <-- ERROR
Traceback (most recent call last):
File "main.py", line 99, in
run(args)
File "main.py", line 70, in run
Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run()
File "11_ai-research-code-master/nvcnet/train.py", line 157, in run
self.train_on_batch(i)
File "11_ai-research-code-master/nvcnet/train.py", line 197, in train_on_batch
p['g_loss'].backward(clear_buffer=True)
File "_variable.pyx", line 826, in nnabla._variable.Variable.backward
RuntimeError: target_specific error in backward_impl
/home/gitlab-runner/builds/-phDBBa6/0/nnabla/builders/all/nnabla-ext-cuda/src/nbla/cuda/cudnn/function/./generic/tanh.cu:79
Failed status == CUDNN_STATUS_SUCCESS: UNKNOWN

I had followed the install page: https://nnabla.org/install/, but it does not work. Could you please give some suggestion?
My environments as follows:
CUDA11.0, cudnn 8.1.0, python 3.6.8

Thank you ! Look forward to your kind reply.

Thank you for checking.
Forward propagation seems to be working, so I hope there is no problem installing nnabla...
If you can use docker, could you please try to run with docker?

cd nvcnet
./scripts/docker_build.sh
docker run --gpus all -u $(id -u):$(id -g) -v $HOME:$HOME -w $(pwd) --rm -it nvcnet:latest /bin/bash
export NUMBA_CACHE_DIR=/tmp
python main.py -c cudnn -d 0

If this is same error, could you please provide GPU information by nvidia-smi -L ?