sowson/darknet

Classifier training behaves different on AMD and Intel GPU

wanfade opened this issue · 2 comments

Hello, @sowson , thanks for yolov4 updating.

I got a problem that training behavior are different on two gpus. When training on the AMD Radeon Pro 455, the loss value decrease normally, but when using Intel(R) HD Graphics 530 on the same MacBook pro, the loss value became nan or inf at the first or second batch. I switch the GPU with '-i 0' and '-i 1'.

It's a 2-classes classification , total images are 383. Base model is darknet19.conv.23 extracted from darknet19.weights.

I also trained some yolov3 models on the two gpus, but loss value not differ so much.

Here are training logs:

Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   256 x 256 x   3   ->   256 x 256 x  32  0.113 BFLOPs
    1 max          2 x 2 / 2   256 x 256 x  32   ->   128 x 128 x  32
    2 conv     64  3 x 3 / 1   128 x 128 x  32   ->   128 x 128 x  64  0.604 BFLOPs
    3 max          2 x 2 / 2   128 x 128 x  64   ->    64 x  64 x  64
    4 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    5 conv     64  1 x 1 / 1    64 x  64 x 128   ->    64 x  64 x  64  0.067 BFLOPs
    6 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    7 max          2 x 2 / 2    64 x  64 x 128   ->    32 x  32 x 128
    8 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
    9 conv    128  1 x 1 / 1    32 x  32 x 256   ->    32 x  32 x 128  0.067 BFLOPs
   10 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
   11 max          2 x 2 / 2    32 x  32 x 256   ->    16 x  16 x 256
   12 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   13 conv    256  1 x 1 / 1    16 x  16 x 512   ->    16 x  16 x 256  0.067 BFLOPs
   14 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   15 conv    256  1 x 1 / 1    16 x  16 x 512   ->    16 x  16 x 256  0.067 BFLOPs
   16 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   17 max          2 x 2 / 2    16 x  16 x 512   ->     8 x   8 x 512
   18 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   19 conv    512  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x 512  0.067 BFLOPs
   20 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   21 conv    512  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x 512  0.067 BFLOPs
   22 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   23 conv      2  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x   2  0.000 BFLOPs
   24 avg                        8 x   8 x   2   ->     2
   25 softmax                                           2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
128 448
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000046 seconds
1, 0.334: 13.241982, 13.241982 avg, 0.000000 rate, 47.324300 seconds, 128 images
Loaded: 0.000050 seconds
2, 0.668: nan, nan avg, 0.000000 rate, 61.340651 seconds, 256 images
Loaded: 0.000050 seconds
3, 1.003: nan, nan avg, 0.000000 rate, 61.352529 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000044 seconds

And on the AMD GPU:

Device IDs: 2
Device ID: 1
Device name: AMD Radeon Pro 455 Compute Engine
Device vendor: AMD
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2 (Sep 11 2020 22:04:49)
Device double precision: YES
Device max group size: 256
Device address bits: 32
dogs
1
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   256 x 256 x   3   ->   256 x 256 x  32  0.113 BFLOPs
    1 max          2 x 2 / 2   256 x 256 x  32   ->   128 x 128 x  32
    2 conv     64  3 x 3 / 1   128 x 128 x  32   ->   128 x 128 x  64  0.604 BFLOPs
    3 max          2 x 2 / 2   128 x 128 x  64   ->    64 x  64 x  64
    4 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    5 conv     64  1 x 1 / 1    64 x  64 x 128   ->    64 x  64 x  64  0.067 BFLOPs
    6 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    7 max          2 x 2 / 2    64 x  64 x 128   ->    32 x  32 x 128
    8 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
    9 conv    128  1 x 1 / 1    32 x  32 x 256   ->    32 x  32 x 128  0.067 BFLOPs
   10 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
   11 max          2 x 2 / 2    32 x  32 x 256   ->    16 x  16 x 256
   12 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   13 conv    256  1 x 1 / 1    16 x  16 x 512   ->    16 x  16 x 256  0.067 BFLOPs
   14 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   15 conv    256  1 x 1 / 1    16 x  16 x 512   ->    16 x  16 x 256  0.067 BFLOPs
   16 conv    512  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 512  0.604 BFLOPs
   17 max          2 x 2 / 2    16 x  16 x 512   ->     8 x   8 x 512
   18 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   19 conv    512  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x 512  0.067 BFLOPs
   20 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   21 conv    512  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x 512  0.067 BFLOPs
   22 conv   1024  3 x 3 / 1     8 x   8 x 512   ->     8 x   8 x1024  0.604 BFLOPs
   23 conv      2  1 x 1 / 1     8 x   8 x1024   ->     8 x   8 x   2  0.000 BFLOPs
   24 avg                        8 x   8 x   2   ->     2
   25 softmax                                           2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
128 448
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000053 seconds
1, 0.334: 0.724708, 0.724708 avg, 0.000000 rate, 20.728899 seconds, 128 images
Loaded: 0.000023 seconds
2, 0.668: 0.699131, 0.722151 avg, 0.000000 rate, 28.956159 seconds, 256 images
Loaded: 0.000042 seconds
3, 1.003: 0.740397, 0.723975 avg, 0.000000 rate, 29.074495 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000050 seconds
4, 1.337: 0.764463, 0.728024 avg, 0.000000 rate, 21.115983 seconds, 512 images
Loaded: 0.000041 seconds

hello, @wanfade the issue you have is a smaller VRAM memory issue. Try to use 64x64 in your NET CFG instead of 256x256 net size for a moment, and then the amount of allocated memory will be smaller and you can use Intel I presume, let me know the effect of that experiment. Thanks!

You are right. Maybe the network structure is too big to my intel GPU.

I changed the input image size to 32x32, 64x64(also modified min crop and max crop), the first batch became normal. But from second batch, the loss value got nan.

Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x  32  0.002 BFLOPs
    1 max          2 x 2 / 2    32 x  32 x  32   ->    16 x  16 x  32
    2 conv     64  3 x 3 / 1    16 x  16 x  32   ->    16 x  16 x  64  0.009 BFLOPs
    3 max          2 x 2 / 2    16 x  16 x  64   ->     8 x   8 x  64
    4 conv    128  3 x 3 / 1     8 x   8 x  64   ->     8 x   8 x 128  0.009 BFLOPs
    5 conv     64  1 x 1 / 1     8 x   8 x 128   ->     8 x   8 x  64  0.001 BFLOPs
    6 conv    128  3 x 3 / 1     8 x   8 x  64   ->     8 x   8 x 128  0.009 BFLOPs
    7 max          2 x 2 / 2     8 x   8 x 128   ->     4 x   4 x 128
    8 conv    256  3 x 3 / 1     4 x   4 x 128   ->     4 x   4 x 256  0.009 BFLOPs
    9 conv    128  1 x 1 / 1     4 x   4 x 256   ->     4 x   4 x 128  0.001 BFLOPs
   10 conv    256  3 x 3 / 1     4 x   4 x 128   ->     4 x   4 x 256  0.009 BFLOPs
   11 max          2 x 2 / 2     4 x   4 x 256   ->     2 x   2 x 256
   12 conv    512  3 x 3 / 1     2 x   2 x 256   ->     2 x   2 x 512  0.009 BFLOPs
   13 conv    256  1 x 1 / 1     2 x   2 x 512   ->     2 x   2 x 256  0.001 BFLOPs
   14 conv    512  3 x 3 / 1     2 x   2 x 256   ->     2 x   2 x 512  0.009 BFLOPs
   15 conv    256  1 x 1 / 1     2 x   2 x 512   ->     2 x   2 x 256  0.001 BFLOPs
   16 conv    512  3 x 3 / 1     2 x   2 x 256   ->     2 x   2 x 512  0.009 BFLOPs
   17 max          2 x 2 / 2     2 x   2 x 512   ->     1 x   1 x 512
   18 conv   1024  3 x 3 / 1     1 x   1 x 512   ->     1 x   1 x1024  0.009 BFLOPs
   19 conv    512  1 x 1 / 1     1 x   1 x1024   ->     1 x   1 x 512  0.001 BFLOPs
   20 conv   1024  3 x 3 / 1     1 x   1 x 512   ->     1 x   1 x1024  0.009 BFLOPs
   21 conv    512  1 x 1 / 1     1 x   1 x1024   ->     1 x   1 x 512  0.001 BFLOPs
   22 conv   1024  3 x 3 / 1     1 x   1 x 512   ->     1 x   1 x1024  0.009 BFLOPs
   23 conv      2  1 x 1 / 1     1 x   1 x1024   ->     1 x   1 x   2  0.000 BFLOPs
   24 avg                        1 x   1 x   2   ->     2
   25 softmax                                           2
Loading weights from ../Darknet_Opencl/darknet19.conv.23...Done!
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
32 32
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000047 seconds
1, 0.334: 0.895304, 0.895304 avg, 0.000000 rate, 12.894553 seconds, 128 images
Loaded: 0.000045 seconds
2, 0.668: nan, nan avg, 0.000000 rate, 13.318852 seconds, 256 images
Loaded: 0.000041 seconds
3, 1.003: nan, nan avg, 0.000000 rate, 13.335070 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000052 seconds
4, 1.337: nan, nan avg, 0.000000 rate, 12.436153 seconds, 512 images
Loaded: 0.000043 seconds
5, 1.671: nan, nan avg, 0.000000 rate, 13.279034 seconds, 640 images
Loaded: 0.000044 seconds

Then I changed the network structure which make all convolution filters no more than 256 , and keep input size to be 256*256, without base model , training log is normal.

Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   256 x 256 x   3   ->   256 x 256 x  32  0.113 BFLOPs
    1 max          2 x 2 / 2   256 x 256 x  32   ->   128 x 128 x  32
    2 conv     64  3 x 3 / 1   128 x 128 x  32   ->   128 x 128 x  64  0.604 BFLOPs
    3 max          2 x 2 / 2   128 x 128 x  64   ->    64 x  64 x  64
    4 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    5 conv     64  1 x 1 / 1    64 x  64 x 128   ->    64 x  64 x  64  0.067 BFLOPs
    6 conv    128  3 x 3 / 1    64 x  64 x  64   ->    64 x  64 x 128  0.604 BFLOPs
    7 max          2 x 2 / 2    64 x  64 x 128   ->    32 x  32 x 128
    8 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
    9 conv    128  1 x 1 / 1    32 x  32 x 256   ->    32 x  32 x 128  0.067 BFLOPs
   10 conv    256  3 x 3 / 1    32 x  32 x 128   ->    32 x  32 x 256  0.604 BFLOPs
   11 max          2 x 2 / 2    32 x  32 x 256   ->    16 x  16 x 256
   12 conv    256  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 256  0.302 BFLOPs
   13 conv    256  1 x 1 / 1    16 x  16 x 256   ->    16 x  16 x 256  0.034 BFLOPs
   14 conv    256  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 256  0.302 BFLOPs
   15 conv    256  1 x 1 / 1    16 x  16 x 256   ->    16 x  16 x 256  0.034 BFLOPs
   16 conv    256  3 x 3 / 1    16 x  16 x 256   ->    16 x  16 x 256  0.302 BFLOPs
   17 max          2 x 2 / 2    16 x  16 x 256   ->     8 x   8 x 256
   18 conv    256  3 x 3 / 1     8 x   8 x 256   ->     8 x   8 x 256  0.075 BFLOPs
   19 conv    256  1 x 1 / 1     8 x   8 x 256   ->     8 x   8 x 256  0.008 BFLOPs
   20 conv    256  3 x 3 / 1     8 x   8 x 256   ->     8 x   8 x 256  0.075 BFLOPs
   21 conv    256  1 x 1 / 1     8 x   8 x 256   ->     8 x   8 x 256  0.008 BFLOPs
   22 conv    256  3 x 3 / 1     8 x   8 x 256   ->     8 x   8 x 256  0.075 BFLOPs
   23 conv      2  1 x 1 / 1     8 x   8 x 256   ->     8 x   8 x   2  0.000 BFLOPs
   24 avg                        8 x   8 x   2   ->     2
   25 softmax                                           2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
256 256
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000048 seconds
1, 0.334: 0.693147, 0.693147 avg, 0.000000 rate, 42.996210 seconds, 128 images
Loaded: 0.000038 seconds
2, 0.668: 0.693147, 0.693147 avg, 0.000000 rate, 46.077885 seconds, 256 images
Loaded: 0.000030 seconds
3, 1.003: 0.693147, 0.693147 avg, 0.000000 rate, 46.072171 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights