Get different accuracy on different GPU

Question

Get different accuracy on different GPU

jimmylin0979 opened this issue 2 years ago · 4 comments

Hi, I trained the mobilevit-xxs model on 2 different machine, and I got different results, while the accuracy on Titan RTX is always lower than the one on RTX 2080Ti by 0.5%.

Below is the specs of 2 machines:

Machine 1: Titan RTX, pytorch 1.10.0
Machine 2: RTX 2080 Ti pytorch 1.11.0

After checking the code, I can only think of AMP as potential problem, but both gpus are using TU102 as chip, so they should support the same precision of float.

Do you have any idea about where might cause the problem ?

Thank you

Answer 1 · 2022-10-30T02:05:33.000Z

Besides AMP, TF32 matmul could also be a culprit.

Answer 2 · 2022-11-02T00:29:59.000Z

Thanks for the fast replying !

Besides AMP, TF32 matmul could also be a culprit.

Under this circumstance, which mode did the experiments be trained ? in TF32 or FP32 mode ?
Thank you !

Answer 3 · 2022-11-02T00:47:03.000Z

Try to enable TF32. Also, use a longer warmup of 20 epochs.

Answer 4 · 2022-11-02T00:54:18.000Z

Try to enable TF32. Also, use a longer warmup of 20 epochs.

Thanks ! I will have a experiment with that settings !