RobotLocomotion/drake

xcode 16 + python + multiple shared libraries + dynamic_cast ==> fail

Closed this issue · 4 comments

What happened?

On macos/xcode 16:

$ bazel test //bindings/pydrake/systems:py/custom_test

fails with a std::bad_cast exception. Similarly

  • //examples/acrobot:py/spong_sim_lib_py_test
  • //examples/acrobot:py/spong_sim_main_py_test
  • //bindings/pydrake/examples:py/acrobot_test

A full CI build log: https://drake-jenkins.csail.mit.edu/view/Mac%20Sequoia%20Unprovisioned/job/mac-arm-sequoia-unprovisioned-clang-bazel-experimental-release/13/consoleFull

Version

master circa 1.35

What operating system are you using?

macOS 14 (Sonoma)

What installation option are you using?

compiled from source code using Bazel

Relevant log output

No response

On my dev branch, with extra instrumentation, we can see that there are two addresses that contain the same type descriptor:

ricopoyner@TRI-X9DWTVD9TR drake % bazel test //bindings/pydrake/systems:py/custom_test
INFO: Analyzed target //bindings/pydrake/systems:py/custom_test (1 packages loaded, 16 targets configured).
INFO: From Linking bindings/pydrake/systems/test/test_util.cpython-312-darwin.so:
ld: warning: duplicate -rpath '/opt/homebrew/Cellar/fmt/11.0.2/lib' ignored
FAIL: //bindings/pydrake/systems:py/custom_test (see /private/var/tmp/_bazel_ricopoyner/27b47a6d9b400570878eb2115555e985/execroot/drake/bazel-out/darwin_arm64-opt/testlogs/bindings/pydrake/systems/py/custom_test/test.log)
INFO: From Testing //bindings/pydrake/systems:py/custom_test:
==================== Test output for //bindings/pydrake/systems:py/custom_test:

Running tests...
----------------------------------------------------------------------
....E.............
======================================================================
ERROR [0.004s]: test_all_leaf_system_overrides (custom_test.TestCustom.test_all_leaf_system_overrides)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_ricopoyner/27b47a6d9b400570878eb2115555e985/sandbox/darwin-sandbox/194/execroot/drake/bazel-out/darwin_arm64-opt/bin/bindings/pydrake/systems/py/custom_test.runfiles/drake/bindings/pydrake/systems/test/custom_test.py", line 584, in test_all_leaf_system_overrides
    results = call_leaf_system_overrides(system)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: is_dynamic_castable<drake::systems::LeafEventCollection<drake::systems::PublishEvent<double>>@0x109bd91c0>(drake::systems::EventCollection<drake::systems::PublishEvent<double>>* ptr) failed because ptr is of dynamic type drake::systems::LeafEventCollection<drake::systems::PublishEvent<double>>@0x1039d09e0.

----------------------------------------------------------------------
Ran 18 tests in 0.030s

FAILED (errors=1)

Generating XML reports...
================================================================================
INFO: Found 1 test target...
Target //bindings/pydrake/systems:py/custom_test up-to-date:
  bazel-bin/bindings/pydrake/systems/py/custom_test
INFO: Elapsed time: 1.899s, Critical Path: 1.54s
INFO: 4 processes: 2 internal, 2 darwin-sandbox.
INFO: Build completed, 1 test FAILED, 4 total actions
//bindings/pydrake/systems:py/custom_test                                FAILED in 0.7s
  /private/var/tmp/_bazel_ricopoyner/27b47a6d9b400570878eb2115555e985/execroot/drake/bazel-out/darwin_arm64-opt/testlogs/bindings/pydrake/systems/py/custom_test/test.log

Executed 1 out of 1 test: 1 fails locally.

They are from two shared libraries: libdrake.so and bindings/pydrake/systems/test/test_util.cpython-312-darwin.so. This situation is no different than before, but with xcode 15 the tests passed. I believe that older implementations of dynamic_cast would use (or fall back to) type string comparison if the addresses did not match. This appears to be no longer the case.

I've tried a lot of voodoo recommended by the interwebs (RTLD_GLOBAL, clang type_visibility attribute, ld -flat_namespace, etc.) to no avail. I suspect our choices boil down to:

  • avoid/replace/reimplement dynamic_cast
  • re-architect .so linking to avoid duplicate symbols
  • something else?

Along the lines of #22205, I'm working to identify and patch the relatively few dynamic_cast invocations that actually cause failures in the xcode 16 current build. I'll turn up with PR when things are passing.

Nope. Nah. Nevermind. Removing dynamic_casts is neither correct nor sustainable.

I did some more reading of llvmorg-project changes. It turns we probably instead want --copt=-fno-assume-unique-vtables.