Classifier training behaves different on AMD and Intel GPU
wanfade opened this issue · 2 comments
Hello, @sowson , thanks for yolov4 updating.
I got a problem that training behavior are different on two gpus. When training on the AMD Radeon Pro 455, the loss value decrease normally, but when using Intel(R) HD Graphics 530 on the same MacBook pro, the loss value became nan or inf at the first or second batch. I switch the GPU with '-i 0' and '-i 1'.
It's a 2-classes classification , total images are 383. Base model is darknet19.conv.23 extracted from darknet19.weights.
I also trained some yolov3 models on the two gpus, but loss value not differ so much.
Here are training logs:
Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer filters size input output
0 conv 32 3 x 3 / 1 256 x 256 x 3 -> 256 x 256 x 32 0.113 BFLOPs
1 max 2 x 2 / 2 256 x 256 x 32 -> 128 x 128 x 32
2 conv 64 3 x 3 / 1 128 x 128 x 32 -> 128 x 128 x 64 0.604 BFLOPs
3 max 2 x 2 / 2 128 x 128 x 64 -> 64 x 64 x 64
4 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
5 conv 64 1 x 1 / 1 64 x 64 x 128 -> 64 x 64 x 64 0.067 BFLOPs
6 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
7 max 2 x 2 / 2 64 x 64 x 128 -> 32 x 32 x 128
8 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
9 conv 128 1 x 1 / 1 32 x 32 x 256 -> 32 x 32 x 128 0.067 BFLOPs
10 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
11 max 2 x 2 / 2 32 x 32 x 256 -> 16 x 16 x 256
12 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
13 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
14 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
15 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
16 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
17 max 2 x 2 / 2 16 x 16 x 512 -> 8 x 8 x 512
18 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
19 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
20 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
21 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
22 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
23 conv 2 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 2 0.000 BFLOPs
24 avg 8 x 8 x 2 -> 2
25 softmax 2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
128 448
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000046 seconds
1, 0.334: 13.241982, 13.241982 avg, 0.000000 rate, 47.324300 seconds, 128 images
Loaded: 0.000050 seconds
2, 0.668: nan, nan avg, 0.000000 rate, 61.340651 seconds, 256 images
Loaded: 0.000050 seconds
3, 1.003: nan, nan avg, 0.000000 rate, 61.352529 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000044 seconds
And on the AMD GPU:
Device IDs: 2
Device ID: 1
Device name: AMD Radeon Pro 455 Compute Engine
Device vendor: AMD
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2 (Sep 11 2020 22:04:49)
Device double precision: YES
Device max group size: 256
Device address bits: 32
dogs
1
layer filters size input output
0 conv 32 3 x 3 / 1 256 x 256 x 3 -> 256 x 256 x 32 0.113 BFLOPs
1 max 2 x 2 / 2 256 x 256 x 32 -> 128 x 128 x 32
2 conv 64 3 x 3 / 1 128 x 128 x 32 -> 128 x 128 x 64 0.604 BFLOPs
3 max 2 x 2 / 2 128 x 128 x 64 -> 64 x 64 x 64
4 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
5 conv 64 1 x 1 / 1 64 x 64 x 128 -> 64 x 64 x 64 0.067 BFLOPs
6 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
7 max 2 x 2 / 2 64 x 64 x 128 -> 32 x 32 x 128
8 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
9 conv 128 1 x 1 / 1 32 x 32 x 256 -> 32 x 32 x 128 0.067 BFLOPs
10 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
11 max 2 x 2 / 2 32 x 32 x 256 -> 16 x 16 x 256
12 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
13 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
14 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
15 conv 256 1 x 1 / 1 16 x 16 x 512 -> 16 x 16 x 256 0.067 BFLOPs
16 conv 512 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 512 0.604 BFLOPs
17 max 2 x 2 / 2 16 x 16 x 512 -> 8 x 8 x 512
18 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
19 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
20 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
21 conv 512 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 512 0.067 BFLOPs
22 conv 1024 3 x 3 / 1 8 x 8 x 512 -> 8 x 8 x1024 0.604 BFLOPs
23 conv 2 1 x 1 / 1 8 x 8 x1024 -> 8 x 8 x 2 0.000 BFLOPs
24 avg 8 x 8 x 2 -> 2
25 softmax 2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
128 448
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000053 seconds
1, 0.334: 0.724708, 0.724708 avg, 0.000000 rate, 20.728899 seconds, 128 images
Loaded: 0.000023 seconds
2, 0.668: 0.699131, 0.722151 avg, 0.000000 rate, 28.956159 seconds, 256 images
Loaded: 0.000042 seconds
3, 1.003: 0.740397, 0.723975 avg, 0.000000 rate, 29.074495 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000050 seconds
4, 1.337: 0.764463, 0.728024 avg, 0.000000 rate, 21.115983 seconds, 512 images
Loaded: 0.000041 seconds
hello, @wanfade the issue you have is a smaller VRAM memory issue. Try to use 64x64 in your NET CFG instead of 256x256 net size for a moment, and then the amount of allocated memory will be smaller and you can use Intel I presume, let me know the effect of that experiment. Thanks!
You are right. Maybe the network structure is too big to my intel GPU.
I changed the input image size to 32x32, 64x64(also modified min crop and max crop), the first batch became normal. But from second batch, the loss value got nan.
Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer filters size input output
0 conv 32 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 32 0.002 BFLOPs
1 max 2 x 2 / 2 32 x 32 x 32 -> 16 x 16 x 32
2 conv 64 3 x 3 / 1 16 x 16 x 32 -> 16 x 16 x 64 0.009 BFLOPs
3 max 2 x 2 / 2 16 x 16 x 64 -> 8 x 8 x 64
4 conv 128 3 x 3 / 1 8 x 8 x 64 -> 8 x 8 x 128 0.009 BFLOPs
5 conv 64 1 x 1 / 1 8 x 8 x 128 -> 8 x 8 x 64 0.001 BFLOPs
6 conv 128 3 x 3 / 1 8 x 8 x 64 -> 8 x 8 x 128 0.009 BFLOPs
7 max 2 x 2 / 2 8 x 8 x 128 -> 4 x 4 x 128
8 conv 256 3 x 3 / 1 4 x 4 x 128 -> 4 x 4 x 256 0.009 BFLOPs
9 conv 128 1 x 1 / 1 4 x 4 x 256 -> 4 x 4 x 128 0.001 BFLOPs
10 conv 256 3 x 3 / 1 4 x 4 x 128 -> 4 x 4 x 256 0.009 BFLOPs
11 max 2 x 2 / 2 4 x 4 x 256 -> 2 x 2 x 256
12 conv 512 3 x 3 / 1 2 x 2 x 256 -> 2 x 2 x 512 0.009 BFLOPs
13 conv 256 1 x 1 / 1 2 x 2 x 512 -> 2 x 2 x 256 0.001 BFLOPs
14 conv 512 3 x 3 / 1 2 x 2 x 256 -> 2 x 2 x 512 0.009 BFLOPs
15 conv 256 1 x 1 / 1 2 x 2 x 512 -> 2 x 2 x 256 0.001 BFLOPs
16 conv 512 3 x 3 / 1 2 x 2 x 256 -> 2 x 2 x 512 0.009 BFLOPs
17 max 2 x 2 / 2 2 x 2 x 512 -> 1 x 1 x 512
18 conv 1024 3 x 3 / 1 1 x 1 x 512 -> 1 x 1 x1024 0.009 BFLOPs
19 conv 512 1 x 1 / 1 1 x 1 x1024 -> 1 x 1 x 512 0.001 BFLOPs
20 conv 1024 3 x 3 / 1 1 x 1 x 512 -> 1 x 1 x1024 0.009 BFLOPs
21 conv 512 1 x 1 / 1 1 x 1 x1024 -> 1 x 1 x 512 0.001 BFLOPs
22 conv 1024 3 x 3 / 1 1 x 1 x 512 -> 1 x 1 x1024 0.009 BFLOPs
23 conv 2 1 x 1 / 1 1 x 1 x1024 -> 1 x 1 x 2 0.000 BFLOPs
24 avg 1 x 1 x 2 -> 2
25 softmax 2
Loading weights from ../Darknet_Opencl/darknet19.conv.23...Done!
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
32 32
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000047 seconds
1, 0.334: 0.895304, 0.895304 avg, 0.000000 rate, 12.894553 seconds, 128 images
Loaded: 0.000045 seconds
2, 0.668: nan, nan avg, 0.000000 rate, 13.318852 seconds, 256 images
Loaded: 0.000041 seconds
3, 1.003: nan, nan avg, 0.000000 rate, 13.335070 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights
Loaded: 0.000052 seconds
4, 1.337: nan, nan avg, 0.000000 rate, 12.436153 seconds, 512 images
Loaded: 0.000043 seconds
5, 1.671: nan, nan avg, 0.000000 rate, 13.279034 seconds, 640 images
Loaded: 0.000044 seconds
Then I changed the network structure which make all convolution filters no more than 256 , and keep input size to be 256*256, without base model , training log is normal.
Device IDs: 2
Device ID: 0
Device name: Intel(R) HD Graphics 530
Device vendor: Intel Inc.
Device opencl availability: OpenCL 1.2
Device opencl used: 1.2(Aug 31 2020 22:26:30)
Device double precision: NO
Device max group size: 256
Device address bits: 64
dogs
1
layer filters size input output
0 conv 32 3 x 3 / 1 256 x 256 x 3 -> 256 x 256 x 32 0.113 BFLOPs
1 max 2 x 2 / 2 256 x 256 x 32 -> 128 x 128 x 32
2 conv 64 3 x 3 / 1 128 x 128 x 32 -> 128 x 128 x 64 0.604 BFLOPs
3 max 2 x 2 / 2 128 x 128 x 64 -> 64 x 64 x 64
4 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
5 conv 64 1 x 1 / 1 64 x 64 x 128 -> 64 x 64 x 64 0.067 BFLOPs
6 conv 128 3 x 3 / 1 64 x 64 x 64 -> 64 x 64 x 128 0.604 BFLOPs
7 max 2 x 2 / 2 64 x 64 x 128 -> 32 x 32 x 128
8 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
9 conv 128 1 x 1 / 1 32 x 32 x 256 -> 32 x 32 x 128 0.067 BFLOPs
10 conv 256 3 x 3 / 1 32 x 32 x 128 -> 32 x 32 x 256 0.604 BFLOPs
11 max 2 x 2 / 2 32 x 32 x 256 -> 16 x 16 x 256
12 conv 256 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 256 0.302 BFLOPs
13 conv 256 1 x 1 / 1 16 x 16 x 256 -> 16 x 16 x 256 0.034 BFLOPs
14 conv 256 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 256 0.302 BFLOPs
15 conv 256 1 x 1 / 1 16 x 16 x 256 -> 16 x 16 x 256 0.034 BFLOPs
16 conv 256 3 x 3 / 1 16 x 16 x 256 -> 16 x 16 x 256 0.302 BFLOPs
17 max 2 x 2 / 2 16 x 16 x 256 -> 8 x 8 x 256
18 conv 256 3 x 3 / 1 8 x 8 x 256 -> 8 x 8 x 256 0.075 BFLOPs
19 conv 256 1 x 1 / 1 8 x 8 x 256 -> 8 x 8 x 256 0.008 BFLOPs
20 conv 256 3 x 3 / 1 8 x 8 x 256 -> 8 x 8 x 256 0.075 BFLOPs
21 conv 256 1 x 1 / 1 8 x 8 x 256 -> 8 x 8 x 256 0.008 BFLOPs
22 conv 256 3 x 3 / 1 8 x 8 x 256 -> 8 x 8 x 256 0.075 BFLOPs
23 conv 2 1 x 1 / 1 8 x 8 x 256 -> 8 x 8 x 2 0.000 BFLOPs
24 avg 8 x 8 x 2 -> 2
25 softmax 2
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.0005
383
256 256
Saving weights to dogs/backup/dogs.start.conv.weights
Loaded: 0.000048 seconds
1, 0.334: 0.693147, 0.693147 avg, 0.000000 rate, 42.996210 seconds, 128 images
Loaded: 0.000038 seconds
2, 0.668: 0.693147, 0.693147 avg, 0.000000 rate, 46.077885 seconds, 256 images
Loaded: 0.000030 seconds
3, 1.003: 0.693147, 0.693147 avg, 0.000000 rate, 46.072171 seconds, 384 images
Saving weights to dogs/backup/dogs_1.weights