CL_OUT_OF_RESOURCES error when training in classifier .

Question

CL_OUT_OF_RESOURCES error when training in classifier .

wanfade opened this issue 4 years ago · 5 comments

Hello, @sowson thank your great work.
I used this project for training yolov3 with opencl, and it worked very well. But when training a classifier, there is a opencl backward_scale_kernel error: CL_OUT_OF_RESOURCES. I want to know if there are some problem with my way.

I used a customize dataset with 2 classes, and modified filters of the last convolution layer in cfg/darknet19.cfg.

Here are my spec:
$ ./darknet classifier Device IDs: 2
Device ID: 0
Device name: Tesla V100-PCIE-16GB
Device vendor: NVIDIA Corporation
Device opencl availability: Device opencl used: 440.33.01
Device double precision: YES
Device max group size: 1024
Device address bits: 64
darknet19
1
layer filters size 0 conv 32 3 x 3 / 1 1 max 2 x 2 / 2 2 conv 64 3 x 3 / 1 3 max 2 x 2 / 2 4 conv 128 3 x 3 / 1 5 conv 64 1 x 1 / 1 6 conv 128 3 x 3 / 1 7 max 2 x 2 / 2 8 conv 256 3 x 3 / 1 9 conv 128 1 x 1 / 1 10 conv 256 3 x 3 / 1 11 max 2 x 2 / 2 12 conv 512 3 x 3 / 1 13 conv 256 1 x 1 / 1 14 conv 512 3 x 3 / 1 15 conv 256 1 x 1 / 1 16 conv 512 3 x 3 / 1 17 max 2 x 2 / 2 18 conv 1024 3 x 3 / 1 19 conv 512 1 x 1 / 1 20 conv 1024 3 x 3 / 1 21 conv 512 1 x 1 / 1 22 conv 1024 3 x 3 / 1 23 conv 2 1 x 1 / 1 24 avg 25 softmax Learning Rate: 0.1, 384
128 448
Saving weights to Loaded: 0.000083 seconds
opencl backward_scale_kernel train dogs/dogs.data cfg/darknet19.cfg
OpenCL 1.2 CUDA
input output
256 x 256 x 3 -> 256 x 256 x 32 0.113 BFLOPs
256 x 256 x 32 -> 128 x 128 x 32
128 x 128 x 32 -> 128 x 128 x 64 0.604 BFLOPs
128 x 128 x 64 -> 64 x 64 x 64
64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
64 x 64 x 128 -> 64 x 64 x 64 0.067 BFLOPs
64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
64 x 64 x 128 -> 32 x 32 x 128
32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
32 x 32 x 256 -> 32 x 32 x 128 0.067 BFLOPs
32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
32 x 32 x 256 -> 16 x 16 x 256
16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
16 x 16 x 512 -> 8 x 8 x 512
8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
8 x 8 x1024 -> 8 x 8 x 2 0.000 BFLOPs
8 x 8 x 2 -> 2
2
Momentum: 0.9, Decay: 0.0005
dogs/backup/darknet19.start.conv.weights
error: CL_OUT_OF_RESOURCES

I set breakpoint in opencl.c , and it finally positioned to line 837 with :
clErr = clEnqueueNDRangeKernel(opencl_queues[opencl_device_id_t], kernel, 2, globalOffser, globalItems, localItems, 0, NULL, NULL);
And it is in the backward_gpu function of a convolution layer.

Thanks a lot

Answer 1 · 2020-09-14T11:22:10.000Z

@wanfade can you please compile and use https://github.com/clMathLibraries/clBLAS also in the blas_kernels.c int tuning = 16 you may change in 7 cases to value 8 or 4. on my macOS with 2 Radeon VII there is no issue but I am using "mine" clBLAS mentioned above. Thanks!

Answer 2 · 2020-09-15T02:00:30.000Z

Thanks for your reply. I compiled darknet with the clBLAS project, and changed 'int tuning = 16' in the blass_kernels.c with 8,4,2. But there is still the 'opencl backward_scale_kernel error: CL_OUT_OF_RESOURCES'.

I try the same code on a macbook without the clBLAS mentioned above, it work well.

Maybe it is an error on 'OpenCL 1.2 CUDA' ?

Answer 3 · 2020-09-15T07:27:39.000Z

@wanfade please try this one patch if it is really on this kernel I have older a bit slower version that I tested on Titan RTX.

nvcl.patch.txt

git apply nvcl.patch.txt

Let me know if that works? Thanks!

Answer 4 · 2020-09-15T07:46:57.000Z

Fine. It works!
The speed is still fast.
Thanks for your help! @sowson

Answer 5 · 2020-09-15T08:36:48.000Z

@wanfade can we close it now? solution commit is in the repository, thanks!