test_mpiapi failing with mpich (on 32-bit arches)
drew-parsons opened this issue · 8 comments
mpi4py 4.0.0 is failing with mpich 4.2.0 on armel and armhf.
Test logs armel armhf
i386 is also failing,
The error (from armel) is
172s testPickle (test_win.TestWinCreateWorld.testPickle) ... ok
172s testPyProps (test_win.TestWinCreateWorld.testPyProps) ... ok
172s testConstructor (test_win.TestWinNull.testConstructor) ... ok
172s testGetName (test_win.TestWinNull.testGetName) ... ok
172s
172s ======================================================================
172s FAIL: testLargeCountSymbols (test_mpiapi.TestMPIAPI.testLargeCountSymbols)
172s ----------------------------------------------------------------------
172s Traceback (most recent call last):
172s File "/tmp/autopkgtest-lxc.qit_k75g/downtmp/build.kYr/src/test/test_mpiapi.py", line 130, in testLargeCountSymbols
172s self.assertIn(sym, mpi_symbols)
172s AssertionError: 'MPI_Op_create' not found in set()
172s
172s ======================================================================
172s FAIL: testSymbolCoverage (test_mpiapi.TestMPIAPI.testSymbolCoverage)
172s ----------------------------------------------------------------------
172s Traceback (most recent call last):
172s File "/tmp/autopkgtest-lxc.qit_k75g/downtmp/build.kYr/src/test/test_mpiapi.py", line 144, in testSymbolCoverage
172s self.assertTrue(mod_symbols)
172s AssertionError: set() is not true
172s
172s ----------------------------------------------------------------------
172s Ran 2081 tests in 124.964s
172s
172s FAILED (failures=2, skipped=95)
The test log on i386 is a bit more chaotic. I think it's also failing in test_msgspec
10030s testNdim (test_msgspec.TestMessageDLPackCPUBuf.testNdim) ... [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10030s [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10030s FAIL
...
10031s testTypestrNone (test_msgspec.TestMessageCAIBuf.testTypestrNone) ... [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10031s [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10031s [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10031s [proxy:0@ci-256-b6cdd533] Sending upstream hdr.cmd = CMD_STDERR
10031s FAIL
...
10031s testShapeType (test_msgspec.TestMessageCAIBuf.testShapeType) ... FAIL
AssertionError: 'MPI_Op_create' not found in set()
That one is weird. Can you run nm $(python3 -m mpi4py --prefix)/MPI.*.so | grep MPI_Op_create
to check whether the extension module references the MPI_Op_create
symbol? If possible, also check the MPI library for these symbols. I'm not really sure what's going on.
In any case, this particular test case is a development sanity check to make sure everything in mpi4py uses the new new large-count routines from MPI-4 if available. You could safely skip this test. I'm still wondering what's going on, though. Maybe nm -Pu
produces a different output of what the mpi4py test is expecting?
Not by that method.
armel:
$ nm $(python3 -m mpi4py --prefix)/MPI.*.so | grep MPI_Op_create
nm: /usr/lib/python3/dist-packages/mpi4py/MPI.cpython-312-arm-linux-gnueabi.so: no symbols
$ nm -Pu $(python3 -m mpi4py --prefix)/MPI.*.so | grep MPI_Op_create
nm: /usr/lib/python3/dist-packages/mpi4py/MPI.cpython-312-arm-linux-gnueabi.so: no symbols
But it shows up as a dynamic symbol
$ nm -D $(python3 -m mpi4py --prefix)/MPI.*.so | grep MPI_Op_create
U MPI_Op_create
U MPI_Op_create_c
$ nm -D -Pu $(python3 -m mpi4py --prefix)/MPI.*.so | grep MPI_Op_create
MPI_Op_create U
MPI_Op_create_c U
In the mpich library,
$ nm -D /usr/lib/arm-linux-gnueabi/libmpich.so.12 | grep MPI_Op_create
00118068 W MPI_Op_create
001184b0 W MPI_Op_create_c
00118068 T PMPI_Op_create
001184b0 T PMPI_Op_create_c
I see... the binary is stripped. Is -D
strictly required? If that is the case, then there is the fix, you need a patch to add -D
.
diff --git a/test/test_mpiapi.py b/test/test_mpiapi.py
index c553d071..5a66411d 100644
--- a/test/test_mpiapi.py
+++ b/test/test_mpiapi.py
@@ -106,7 +106,7 @@ class TestMPIAPI(unittest.TestCase):
def get_mod_symbols(self):
nm = shutil.which('nm')
- cmd = [nm, '-Pu', mod_file]
+ cmd = [nm, '-DPu', mod_file]
out = sp.check_output(cmd, close_fds=False)
nm_output = out.decode()
PS: I cannot push that fix just yet, -D
will not work on macOS
.
That fixes the arm problem.
i386 still has a (different) problem.
https://ci.debian.net/packages/m/mpi4py/testing/i386/51642023/
Time out in test_io.TestIOViewWorld.testVector
9804s testVector (test_io.TestIOViewWorld.testVector) ... [proxy:0@ci-258-e18d407d] Sending upstream hdr.cmd = CMD_STDERR
9804s [proxy:0@ci-258-e18d407d] Sending upstream hdr.cmd = CMD_STDERR
9804s ok
9804s testVector (test_io.TestIOViewWorld.testVector) ... [proxy:0@ci-258-e18d407d] Sending upstream hdr.cmd = CMD_STDERR
9804s ok
9804s [proxy:0@ci-258-e18d407d] Sending upstream hdr.cmd = CMD_STDERR
10037s testVector (test_io.TestIOViewWorld.testVector) ... autopkgtest [05:59:59]: ERROR: timed out on command "su -s /bin/bash debci -c set -e; exec /tmp/autopkgtest-lxc.63jngc1h/downtmp/wrapper.sh --artifacts=/tmp/autopkgtest-lxc.63jngc1h/downtmp/command1-artifacts --chdir=/tmp/autopkgtest-lxc.63jngc1h/downtmp/build.gWj/src --env=DEB_BUILD_OPTIONS=parallel=2 --env=DEBIAN_FRONTEND=noninteractive --env=LANG=C.UTF-8 --unset-env=LANGUAGE --unset-env=LC_ADDRESS --unset-env=LC_ALL --unset-env=LC_COLLATE --unset-env=LC_CTYPE --unset-env=LC_IDENTIFICATION --unset-env=LC_MEASUREMENT --unset-env=LC_MESSAGES --unset-env=LC_MONETARY --unset-env=LC_NAME --unset-env=LC_NUMERIC --unset-env=LC_PAPER --unset-env=LC_TELEPHONE --unset-env=LC_TIME --script-pid-file=/tmp/autopkgtest_script_pid --source-profile --stderr=/tmp/autopkgtest-lxc.63jngc1h/downtmp/command1-stderr --stdout=/tmp/autopkgtest-lxc.63jngc1h/downtmp/command1-stdout --tmp=/tmp/autopkgtest-lxc.63jngc1h/downtmp/autopkgtest_tmp -- bash -ec 'for pyver in `py3versions -vs`; do OMPI_MCA_rmaps_base_oversubscribe=yes GITHUB_ACTIONS=true MPI4PY_TEST_SPAWN=false mpiexec -v -n 5 python$pyver test/runtests.py --verbose; done'" (kind: test)
10037s autopkgtest [05:59:59]: test command1: -----------------------]
10037s command1 FAIL timed out
It's hitting 10000 seconds, so I suspect testVector itself is not failing. The tests may have just reached the time limit permitted on the CI server (evidently set to 10000s, around 2 hr 45 min).
I guess its oversubscribing on i386. Tthe tests are configured to use 5 processes, and arm32 finishes in less than 5 minutes. You've advised previously that oversubscribing is particularly slow with mpich. What do you recommend for i386 with mpich?
I'm not sure this is really about oversubscription, but some other problem that leads to deadlock. Note the deadlocking test is related to MOI I/O, 5 processes writing small data collectively to disk should not be a big deal. Of course, all complaints and suspicions related to oversubscription should be directed to the MPICH projects.
Other than skipping tests, I have no further recommendations, I have not used i386 in years. Maybe special case MPICH and do not run in 5 processes, but just max of 3 or 4, like done here.
Apparently the i386 test machine only provides 2 processing units.
https://ci.debian.net/data/autopkgtest/testing/i386/m/mpi4py/51761138/log.gz
The tests are passing with 3 processes. Still taking an hour to run, perhaps I should drop it further to not oversubscribe at all.
Anyway, the 32-bit arches are now passing their mpich tests so I'll close this issue.