[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows
haifengl opened this issue · 7 comments
On macOS and Linux, PyTorch 2.2.2 training with MNIST produces good models that has training accuracy > 90%. However, the same code reports very low accuracy (about 11%) on Windows. The same code works fine with 2.2.1 on Windows. You can run your sample code to reproduce the issue. The below code is to calculate the accuracy:
net.eval();
int correct = 0;
int n = 0;
for (ExampleIterator it = dataLoader.begin(); !it.equals(dataLoader.end()); it = it.increment()) {
Example batch = it.access();
var output = net.forward(batch.data());
var prediction = output.argmax(new LongOptional(1), false);
correct += prediction.eq(batch.target()).sum().item_int();
n += batch.target().size(0);
// Explicitly free native memory
batch.close();
}
System.out.format("Training accuracy = %.2f%%\n", 100.0 * correct / n);
Is it the same with 2.3.0-1.5.11-SNAPSHOT?
2.3.0 doesn't work on Windows. Failed to load jnitorch.dll
. I think that it is the same issue as #1500.
That's strange. the missing library has been added to javacpp, not to pytorch presets.
Anyway I see that the convergence is still abnormal on windows with 2.3.0, like with 2.2.2. It was good on 2.2.1.
I'll try to find out what's happening.
No idea why pytorch 2.2.2 would find liomp140 and not pytorch 2.3.0.
Anyway, there is obviously a problem with this library. The sample MNIST code gives sensible results only if we set OMP_NUM_THREADS
to 1.
I see that the Pytorch team recommend the Intel version (iomp) and that's what they ship with official libtorch.
So I think we must tweek the build on windows to link with iomp instead of what cmake finds by default.
So, after many experiments and code investigations, it turns out that, when the github runner was upgraded with a new version of Visual Studio (about 2 months ago, when we merged Pytorch 2.2.2) the Windows build of libtorch
linked to both the legacy vcomp
and the newer (simd compatible) libomp
libraries. This is the reason for the wrong computation results.
I included a fix in PR #1510 that consists in removing on windows the FindOpenMP.cmake
adaptation from Pytorch to use the normal cmake version. This results in the binary linking with legacy vcomp
only. This works but probably doesn't give the best performance.
Official build uses MKL, which includes openmp. We could do this (linking dynamically instead of statically), but this would require to add a dependency to MKL, even for people using GPU only. And pytorch uses a 2022 version of MKL. Not sure it would work with the 2024 version of the current MKL presets.
I also realized that openblas is not detected by pytorch build. I don't know if and how it is supposed to find it during build.
Thanks for hard working!