emscripten-core/emscripten

TypeError race condition in parallel test runner

Closed this issue · 5 comments

juj commented

Ugh, the PR #25099 surfaces what looks to me like a Python internal multiprocessing bug.

On this run http://clbri.com:8010/api/v2/logs/36311/raw_inline,

error

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 986, in resulting_test
    return func(self, *args)
           ^^^^^^^^^^^^^^^^^
  File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 784, in metafunc
    self.require_wasm_eh()
  File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/common.py", line 1186, in require_wasm_eh
    self.skipTest('test requires node v24 or d8 (and EMTEST_SKIP_EH is set)')
  File "/usr/lib/python3.12/unittest/case.py", line 711, in skipTest
    raise SkipTest(reason)
unittest.case.SkipTest: test requires node v24 or d8 (and EMTEST_SKIP_EH is set)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 633, in run
    with outcome.testPartExecutor(self):
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/lib/python3.12/unittest/case.py", line 63, in testPartExecutor
    _addSkip(self.result, test_case, str(e))
  File "/usr/lib/python3.12/unittest/case.py", line 89, in _addSkip
    addSkip(test_case, reason)
  File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/parallel_testsuite.py", line 206, in addSkip
    print(self.compute_progress(), test, "... skipped '%s'" % reason, file=sys.stderr)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/clb/buildbot/h12dsi-linux-mint22/emscripten_linux_x64/build/emscripten/main/test/parallel_testsuite.py", line 185, in compute_progress
    with self.lock:
  File "/usr/lib/python3.12/multiprocessing/managers.py", line 1055, in __enter__
    return self._callmethod('acquire')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/managers.py", line 820, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 427, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 384, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object cannot be interpreted as an integer

i.e. a test is being skipped as usual, but then the progress bar is about to be calculated, and for that, self.compute_progress() is invoked.

Inside that function the multiprocessing lock is attempted to be obtained with with self.lock:

but Python thinks that multiprocessing lock does not even exist anymore.

I can't fathom how the lock could disappear like that.. the error occurred in the run about 32% through the way, so each process should definitely have the lock. (or I'd imagine it would have failed already earlier)

And also to my understanding, the parallel test loop should be the safe method for joining all multiprocessing processes together, before the lock can go out of scope.

I am trying with main...juj:emscripten:mute_python_multiprocessing_lock_error to diagnose how often this error occurs in the runner.

juj commented

Bisected Python versions to find that we got lucky. The bug is already fixed upstream. It reproduces in the following Python versions:

Python 3.12.3 bug
Python 3.12.7 bug
Python 3.12.8 ok
Python 3.12.9 ok
Python 3.12.11 ok

Python 3.13.0 bug
Python 3.13.1 ok
Python 3.13.2 ok
Python 3.13.4 ok
Python 3.13.7 ok

so we can disable the progress bar (and I see that the issue also reproduces for --failfast) for those Python versions.

juj commented

My Linux box had 3.12.3 where it reproduced. Updated it to 3.13.3 to match Windows and macOS versions.

I recall there is a mention of Python 3.8 in https://github.com/emscripten-core/emsdk/blob/main/README.md#linux as the minimum Python version.. I wonder if any CI is testing that? (Google/CircleCI?) I'd place my bet over to the odds that 3.8 won't be able to run Emscripten without problems.

juj commented

The upstream issue is python/cpython#71936 .

And sure enough, both ChangeLogs for Python 3.12.8 and Python 3.13.1 mention this line:

Python 3.8 is the current minimum python required to run emscripten itself. Bumping that versions, especially a huge jump to something like 3.13 is probably not worth to benefit here.

But requiring a higher version for the test suite is conceivable?... although maybe not ideal because we would need two separate python versions for our normal test runs.

The python3.8 requirement of emscripten is based on the version in Ubuntu/Focal (20.04 LTS) which is use for all our testing in circle CI (https://launchpad.net/ubuntu/focal/+package/python3).