多卡训练，精度大幅度下降

Question

多卡训练，精度大幅度下降

coldlarry opened this issue 3 years ago · 5 comments

coldlarry commented 3 years ago

作者您好，我使用多卡训练（4张），并且增大了你设置的默认单卡bathsize。

训练时，第一次测试mAP接近0.（单卡训练时，第一次一般是0.6）。

想请教一下，这是什么原因呀？是学习率的问题吗？

Answer 1 · 2021-10-22T02:21:18.000Z

单卡训练，调大batchsize，精度也会显著下降。这是为啥呀..................

Answer 2 · 2021-10-22T08:27:24.000Z

你在调大 batch size 的同时调节学习率了吗？

Did you adjust the learning rate while increasing the batch size?

Answer 3 · 2021-10-22T08:29:57.000Z

您好，确实没有调节学习率（没想到学习率影响这么大）。
另外，还想问下您：现在的代码支持多卡训练吗？

Answer 4 · 2021-10-22T08:31:44.000Z

我多卡训练，训练到中间就不print任何信息了，也没报错，我最后kill掉进程了。
看到之前issue有问多卡训练的，所以想问下，是否支持多卡呢现在。

Answer 5 · 2021-10-25T01:46:45.000Z

你可以提供一下输出日志吗？一个可能的原因是各个进程在多个GPU上没有对齐，导致 GPU 利用率达到 100% 卡住。

目前的代码最好是用单 GPU 运行。如果想使用多 GPU 的话，你可以参考问题集锦中训练和测试部分的第 3 个问题和问题 #11

Can you provide the output log? One possible reason is that each process is not aligned on multiple GPUs, causing the GPU utilization to reach 100% and stuck.

The current code is best run on a single GPU. If you want to use multiple GPUs, you can refer to the third question of Training and Test in FAQ and Issue #11