[Bug] 多卡情况下,训练后eval和离线test的精度不能保证一致
whlook opened this issue · 0 comments
whlook commented
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).
Environment
Reproduces the problem - code sample
如果模型带有BN(不是syncbn)进行多卡训练(2卡)后,进行eval的测试,每个rank的bn是不一样的,导致最后测试的精度与test不一致;离线test是重新load同一个pth,所以每次test结果都一致
Reproduces the problem - command or script
必现,在DDP环境下,并且使用了BN会出现
Reproduces the problem - error message
None
Additional information
- eval after train应该保证与test一样的可靠性
- test中所有rank所使用的权重参数都是一样的
- train之后的eval每个rank所使用的bn参数是不一样的
- 在val之前应该做好model同步工作(TODO)