CL_OUT_OF_RESOURCES error when training in classifier .
wanfade opened this issue · 5 comments
Hello, @sowson thank your great work.
I used this project for training yolov3 with opencl, and it worked very well. But when training a classifier, there is a opencl backward_scale_kernel error: CL_OUT_OF_RESOURCES. I want to know if there are some problem with my way.
I used a customize dataset with 2 classes, and modified filters of the last convolution layer in cfg/darknet19.cfg.
Here are my spec:
$ ./darknet classifier train dogs/dogs.data cfg/darknet19.cfg
Device IDs: 2
Device ID: 0
Device name: Tesla V100-PCIE-16GB
Device vendor: NVIDIA Corporation
Device opencl availability: OpenCL 1.2 CUDA
Device opencl used: 440.33.01
Device double precision: YES
Device max group size: 1024
Device address bits: 64
darknet19
1
layer filters size input output
0 conv 32 3 x 3 / 1 256 x 256 x 3 -> 256 x 256 x 32 0.113 BFLOPs
1 max 2 x 2 / 2 256 x 256 x 32 -> 128 x 128 x 32
2 conv 64 3 x 3 / 1 128 x 128 x 32 -> 128 x 128 x 64 0.604 BFLOPs
3 max 2 x 2 / 2 128 x 128 x 64 -> 64 x 64 x 64
4 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
5 conv 64 1 x 1 / 1 64 x 64 x 128 -> 64 x 64 x 64 0.067 BFLOPs
6 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
7 max 2 x 2 / 2 64 x 64 x 128 -> 32 x 32 x 128
8 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
9 conv 128 1 x 1 / 1 32 x 32 x 256 -> 32 x 32 x 128 0.067 BFLOPs
10 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
11 max 2 x 2 / 2 32 x 32 x 256 -> 16 x 16 x 256
12 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
13 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
14 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
15 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
16 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
17 max 2 x 2 / 2 16 x 16 x 512 -> 8 x 8 x 512
18 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
19 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
20 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
21 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
22 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
23 conv 2 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 2 0.000 BFLOPs
24 avg 8 x 8 x 2 -> 2
25 softmax 2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
384
128 448
Saving weights to dogs/backup/darknet19.start.conv.weights
Loaded: 0.000083 seconds
opencl backward_scale_kernel error: CL_OUT_OF_RESOURCES
I set breakpoint in opencl.c , and it finally positioned to line 837 with :
clErr = clEnqueueNDRangeKernel(opencl_queues[opencl_device_id_t], kernel, 2, globalOffser, globalItems, localItems, 0, NULL, NULL);
And it is in the backward_gpu function of a convolution layer.
Thanks a lot
@wanfade can you please compile and use https://github.com/clMathLibraries/clBLAS also in the blas_kernels.c int tuning = 16 you may change in 7 cases to value 8 or 4. on my macOS with 2 Radeon VII there is no issue but I am using "mine" clBLAS mentioned above. Thanks!
Thanks for your reply. I compiled darknet with the clBLAS project, and changed 'int tuning = 16' in the blass_kernels.c with 8,4,2. But there is still the 'opencl backward_scale_kernel error: CL_OUT_OF_RESOURCES'.
I try the same code on a macbook without the clBLAS mentioned above, it work well.
Maybe it is an error on 'OpenCL 1.2 CUDA' ?