dvlab-research/PFENet

results obtained fluctuate greatly even set all random seeds

chunbolang opened this issue · 5 comments

Hi, thanks for sharing your work!
There's a question that troubles me. I set the same random seed every time, but the results obtained fluctuate greatly (about 1%).
Is this due to the "Dropout " operation, or something I neglected?

Regards,
Lang

@chunbolang

Thanks for your attention.

Dropout might be one of the reasons, but it is used for combating overfitting, which is rather important for few-shot segmentation models that are prone to overfit the base classes. You can set the prob. of dropout to a smaller value to check its effect.

Also, we have found that the performance variance is inevitable for all segmentation models (e.g., DeepLab-V3+ and PSPNet). Some people say that the internal implementation of F.interpolate may cause the performance variance even all seeds are fixed. Classification models do not need the F.interpolate function, therefore their results are the same as long as the seeds are kept.

Therefore, to minimize the variance, we proposed to use more testing episodes with the same initial random seeds.

@tianzhuotao

Thanks for the quick reply.

Interpolation operation (e.g, bilinear adopted in paper) is a deterministic process without parameters, How could it affect the experimental results? I am puzzled.

Sorry I have no idea about the exact reason why the fixed seeds still yield different results. Perhaps you can check the source code for more details.

Honestly, that explanation regarding the F.interpolate is found from Google.

@tianzhuotao

Hi,
I tried my best to control the factors that may introduce uncertainty in the experiment.

As observed, the interpolation algorithm (F.interpolate) does cause unexpected experimental fluctuations, which may be similar to the accelerated convolution operation in cuDNN. However, the latter case can be solved by setting "cudnn.benchmark=False" and "cudnn.deterministic=True", As for the former case, I think we can only run the training script multiple times and take the average test accuracy as the final evaluation.

Anyway, thanks for your reply, it really helped me to find the potential problem.

Regards,
Lang

@chunbolang Thank you for sharing your valuable findings!