shap/shap

Segmentation Fault on MacOS with pytorch > 2.2.0

connortann opened this issue ยท 4 comments

EDIT: relevant issue on pytorch: pytorch/pytorch#121101

The test suite recently began failing on MacOS.

Example failing run:

https://github.com/shap/shap/actions/runs/8021717954/job/21914432162

Fatal Python error: Segmentation fault

Thread 0x000070000ca96000 (most recent call first):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 579 in _handle_results
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 982 in run
  File Fatal Python error: Segmentation fault

"/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap

When the python version is pinned to 3.11.7, we seemingly get a different error relating to lightgbm:

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py:376: in __init__
    self._handle = _dlopen(self._name, mode)
E   OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
E     Referenced from: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
E     Reason: image not found

Related:

This may be unrelated, but on the topic of MacOS issues I noticed there was a failure in the mac-os job for the latest proposed release on conda-forge: conda-forge/shap-feedstock#76

Here are a couple notes:
The last successful run of the macos pipeline on master was this: https://github.com/shap/shap/actions/runs/7972563240/job/21765144274.
I debugged the macos pipeline using https://github.com/mxschmitt/action-tmate and found that the segmentation faults happen in the pytorch tests, in the lines where one calls the model on data, e.g. here. From the successful run I found that we used torch version 2.2.0 there instead of 2.2.1. Will check if it works if I pin the version.

#3518 fixed the original issue with the tests by pinning pytorch; let's keep this issue open until the full test suite passes with the latest pytorch.

The pytorch issue is documented in: pytorch/pytorch#121101.