[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows

Question

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows

haifengl opened this issue 7 months ago · 7 comments

On macOS and Linux, PyTorch 2.2.2 training with MNIST produces good models that has training accuracy > 90%. However, the same code reports very low accuracy (about 11%) on Windows. The same code works fine with 2.2.1 on Windows. You can run your sample code to reproduce the issue. The below code is to calculate the accuracy:

        net.eval();
        int correct = 0;
        int n = 0;
        for (ExampleIterator it = dataLoader.begin(); !it.equals(dataLoader.end()); it = it.increment()) {
            Example batch = it.access();
            var output = net.forward(batch.data());
            var prediction = output.argmax(new LongOptional(1), false);
            correct += prediction.eq(batch.target()).sum().item_int();
            n += batch.target().size(0);
            // Explicitly free native memory
            batch.close();
        }
        System.out.format("Training accuracy = %.2f%%\n", 100.0 * correct / n);

Answer 1 · 2024-05-20T03:35:03.000Z

Is it the same with 2.3.0-1.5.11-SNAPSHOT?

Answer 2 · 2024-05-20T17:37:01.000Z

2.3.0 doesn't work on Windows. Failed to load jnitorch.dll. I think that it is the same issue as #1500.

Answer 3 · 2024-05-20T18:18:59.000Z

That's strange. the missing library has been added to javacpp, not to pytorch presets.
Anyway I see that the convergence is still abnormal on windows with 2.3.0, like with 2.2.2. It was good on 2.2.1.
I'll try to find out what's happening.

Answer 4 · 2024-05-20T19:49:30.000Z

See the screenshot. PyTorch 2.3.0 cannot find libomp.dll

Answer 5 · 2024-05-21T21:56:25.000Z

No idea why pytorch 2.2.2 would find liomp140 and not pytorch 2.3.0.
Anyway, there is obviously a problem with this library. The sample MNIST code gives sensible results only if we set OMP_NUM_THREADS to 1.
I see that the Pytorch team recommend the Intel version (iomp) and that's what they ship with official libtorch.
So I think we must tweek the build on windows to link with iomp instead of what cmake finds by default.

Answer 6 · 2024-06-20T15:05:33.000Z

So, after many experiments and code investigations, it turns out that, when the github runner was upgraded with a new version of Visual Studio (about 2 months ago, when we merged Pytorch 2.2.2) the Windows build of libtorch linked to both the legacy vcomp and the newer (simd compatible) libomp libraries. This is the reason for the wrong computation results.

I included a fix in PR #1510 that consists in removing on windows the FindOpenMP.cmake adaptation from Pytorch to use the normal cmake version. This results in the binary linking with legacy vcomp only. This works but probably doesn't give the best performance.

Official build uses MKL, which includes openmp. We could do this (linking dynamically instead of statically), but this would require to add a dependency to MKL, even for people using GPU only. And pytorch uses a 2022 version of MKL. Not sure it would work with the 2024 version of the current MKL presets.

I also realized that openblas is not detected by pytorch build. I don't know if and how it is supposed to find it during build.

Answer 7 · 2024-06-20T21:19:47.000Z

Thanks for hard working!