microsoft/BitBLAS

Segmentation fault when integrated with Ray

Closed this issue · 4 comments

BitBlas throws a segmentation fault when integrated in an environment using Ray, is this something related to the order of loading of bitblas or something? Thank you very much in advance!

*** SIGSEGV received at time=1726233628 on cpu 111 ***
PC: @     0x7f7e3d482e42  (unknown)  (unknown)
    @     0x7f7e3d31e520  (unknown)  (unknown)
[2024-09-13 13:20:28,314 E 120351 121131] logging.cc:440: *** SIGSEGV received at time=1726233628 on cpu 111 ***
[2024-09-13 13:20:28,315 E 120351 121131] logging.cc:440: PC: @     0x7f7e3d482e42  (unknown)  (unknown)
[2024-09-13 13:20:28,315 E 120351 121131] logging.cc:440:     @     0x7f7e3d31e520  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 252 in __init_handle_by_constructor__
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/object.py", line 145 in __init_handle_by_constructor__
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/runtime/object.py", line 101 in __setstate__
  File "/usr/lib/python3.10/copy.py", line 273 in _reconstruct
  File "/usr/lib/python3.10/copy.py", line 172 in deepcopy
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/operator.py", line 141 in apply_default_schedule
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/operator.py", line 158 in _build_default_module
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/general_matmul/__init__.py", line 257 in dispatch_tir
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/general_matmul/__init__.py", line 243 in __init__
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/backends/bitblas.py", line 109 in __init__
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/backends/bitblas.py", line 190 in patch_hqq_to_bitblas
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/models/base.py", line 154 in patch_linearlayers
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/utils/patching.py", line 25 in patch_linearlayers
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/utils/patching.py", line 113 in prepare_for_inference
  File "/root/aana_sdk/aana/deployments/hqq_deployment.py", line 141 in apply_config
  File "/root/aana_sdk/aana/deployments/base_deployment.py", line 23 in reconfigure
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 872 in _call_func_or_gen
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 959 in call_reconfigure
  File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 795 in _run_user_code_event_loop
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Hi @mobicham , thanks for reporting! would you mind provide scripts for us to reproduce?

Sorry for the delay @LeiWang1999 . We fixed the issue by importing bitblas first before anything else.
Is there a logic as of why the import order is important for bitblas ?
Thank you!

@mobicham That's interesting, I met some problems when working with mlc, for example:

import tvm  # upstream

relax_mod = relax_transform(relax_mod)

import welder
relax_mod = welder.tune(relax_mod)
# something bad happened

The problem was that when welder is imported, it also imports in its own version of TVM, which then invokes load_dlls (for example, to load libcutlass). This process ends up overwriting the upstream cutlass lib and lead to some bugs.

I guess there may be similar rationals behind these two cases.

Thanks! Yeah for the moment the trick is to experiment with different import orders and pick the one that doesn't throw an error.
Closing this issue, thank you again!