Segmentation fault when integrated with Ray
Closed this issue · 4 comments
BitBlas throws a segmentation fault when integrated in an environment using Ray, is this something related to the order of loading of bitblas
or something? Thank you very much in advance!
*** SIGSEGV received at time=1726233628 on cpu 111 ***
PC: @ 0x7f7e3d482e42 (unknown) (unknown)
@ 0x7f7e3d31e520 (unknown) (unknown)
[2024-09-13 13:20:28,314 E 120351 121131] logging.cc:440: *** SIGSEGV received at time=1726233628 on cpu 111 ***
[2024-09-13 13:20:28,315 E 120351 121131] logging.cc:440: PC: @ 0x7f7e3d482e42 (unknown) (unknown)
[2024-09-13 13:20:28,315 E 120351 121131] logging.cc:440: @ 0x7f7e3d31e520 (unknown) (unknown)
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 252 in __init_handle_by_constructor__
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/object.py", line 145 in __init_handle_by_constructor__
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/runtime/object.py", line 101 in __setstate__
File "/usr/lib/python3.10/copy.py", line 273 in _reconstruct
File "/usr/lib/python3.10/copy.py", line 172 in deepcopy
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/operator.py", line 141 in apply_default_schedule
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/operator.py", line 158 in _build_default_module
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/general_matmul/__init__.py", line 257 in dispatch_tir
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/bitblas/ops/general_matmul/__init__.py", line 243 in __init__
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/backends/bitblas.py", line 109 in __init__
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/backends/bitblas.py", line 190 in patch_hqq_to_bitblas
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/models/base.py", line 154 in patch_linearlayers
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/utils/patching.py", line 25 in patch_linearlayers
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/hqq/utils/patching.py", line 113 in prepare_for_inference
File "/root/aana_sdk/aana/deployments/hqq_deployment.py", line 141 in apply_config
File "/root/aana_sdk/aana/deployments/base_deployment.py", line 23 in reconfigure
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 872 in _call_func_or_gen
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 959 in call_reconfigure
File "/root/.cache/pypoetry/virtualenvs/aana-XDlPP_xZ-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 795 in _run_user_code_event_loop
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Hi @mobicham , thanks for reporting! would you mind provide scripts for us to reproduce?
Sorry for the delay @LeiWang1999 . We fixed the issue by importing bitblas first before anything else.
Is there a logic as of why the import order is important for bitblas ?
Thank you!
@mobicham That's interesting, I met some problems when working with mlc, for example:
import tvm # upstream
relax_mod = relax_transform(relax_mod)
import welder
relax_mod = welder.tune(relax_mod)
# something bad happened
The problem was that when welder is imported, it also imports in its own version of TVM, which then invokes load_dlls (for example, to load libcutlass). This process ends up overwriting the upstream cutlass lib and lead to some bugs.
I guess there may be similar rationals behind these two cases.
Thanks! Yeah for the moment the trick is to experiment with different import orders and pick the one that doesn't throw an error.
Closing this issue, thank you again!