open-mmlab/mmengine

[Bug] 多卡情况下，训练后eval和离线test的精度不能保证一致

whlook opened this issue 2 months ago · 0 comments

whlook commented 2 months ago

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

Reproduces the problem - code sample

如果模型带有BN(不是syncbn)进行多卡训练（2卡）后，进行eval的测试，每个rank的bn是不一样的，导致最后测试的精度与test不一致；离线test是重新load同一个pth，所以每次test结果都一致

Reproduces the problem - command or script

必现，在DDP环境下，并且使用了BN会出现

Reproduces the problem - error message

None

Additional information

eval after train应该保证与test一样的可靠性
test中所有rank所使用的权重参数都是一样的
train之后的eval每个rank所使用的bn参数是不一样的
在val之前应该做好model同步工作（TODO）