TypeError race condition in parallel test runner
Closed this issue · 5 comments
Ugh, the PR #25099 surfaces what looks to me like a Python internal multiprocessing bug.
On this run http://clbri.com:8010/api/v2/logs/36311/raw_inline,
error
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
yield
File "/usr/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 986, in resulting_test
return func(self, *args)
^^^^^^^^^^^^^^^^^
File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 784, in metafunc
self.require_wasm_eh()
File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 1186, in require_wasm_eh
self.skipTest('test requires node v24 or d8 (and EMTEST_SKIP_EH is set)')
File "/usr/lib/python3.12/unittest/case.py", line 711, in skipTest
raise SkipTest(reason)
unittest.case.SkipTest: test requires node v24 or d8 (and EMTEST_SKIP_EH is set)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.12/unittest/case.py", line 633, in run
with outcome.testPartExecutor(self):
File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File "/usr/lib/python3.12/unittest/case.py", line 63, in testPartExecutor
_addSkip(self.result, test_case, str(e))
File "/usr/lib/python3.12/unittest/case.py", line 89, in _addSkip
addSkip(test_case, reason)
File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/parallel_testsuite.py", line 206, in addSkip
print(self.compute_progress(), test, "... skipped '%s'" % reason, file=sys.stderr)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/parallel_testsuite.py", line 185, in compute_progress
with self.lock:
File "/usr/lib/python3.12/multiprocessing/managers.py", line 1055, in __enter__
return self._callmethod('acquire')
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/managers.py", line 820, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/lib/python3.12/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.12/multiprocessing/connection.py", line 427, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.12/multiprocessing/connection.py", line 384, in _send
n = write(self._handle, buf)
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object cannot be interpreted as an integer
i.e. a test is being skipped as usual, but then the progress bar is about to be calculated, and for that, self.compute_progress() is invoked.
Inside that function the multiprocessing lock is attempted to be obtained with with self.lock:
but Python thinks that multiprocessing lock does not even exist anymore.
I can't fathom how the lock could disappear like that.. the error occurred in the run about 32% through the way, so each process should definitely have the lock. (or I'd imagine it would have failed already earlier)
And also to my understanding, the parallel test loop should be the safe method for joining all multiprocessing processes together, before the lock can go out of scope.
I am trying with main...juj:emscripten:mute_python_multiprocessing_lock_error to diagnose how often this error occurs in the runner.
Bisected Python versions to find that we got lucky. The bug is already fixed upstream. It reproduces in the following Python versions:
Python 3.12.3 bug
Python 3.12.7 bug
Python 3.12.8 ok
Python 3.12.9 ok
Python 3.12.11 ok
Python 3.13.0 bug
Python 3.13.1 ok
Python 3.13.2 ok
Python 3.13.4 ok
Python 3.13.7 ok
so we can disable the progress bar (and I see that the issue also reproduces for --failfast) for those Python versions.
My Linux box had 3.12.3 where it reproduced. Updated it to 3.13.3 to match Windows and macOS versions.
I recall there is a mention of Python 3.8 in https://github.com/emscripten-core/emsdk/blob/main/README.md#linux as the minimum Python version.. I wonder if any CI is testing that? (Google/CircleCI?) I'd place my bet over to the odds that 3.8 won't be able to run Emscripten without problems.
The upstream issue is python/cpython#71936 .
And sure enough, both ChangeLogs for Python 3.12.8 and Python 3.13.1 mention this line:
- gh-71936: Fix a race condition in multiprocessing.pool.Pool.
Python 3.8 is the current minimum python required to run emscripten itself. Bumping that versions, especially a huge jump to something like 3.13 is probably not worth to benefit here.
But requiring a higher version for the test suite is conceivable?... although maybe not ideal because we would need two separate python versions for our normal test runs.
The python3.8 requirement of emscripten is based on the version in Ubuntu/Focal (20.04 LTS) which is use for all our testing in circle CI (https://launchpad.net/ubuntu/focal/+package/python3).