HPUGraph destructor issue when installing dill
yafshar opened this issue · 6 comments
System Info
platform: Linux
tgi_version: v1.2.1
habana-torch-dataloader 1.14.0.493
habana-torch-plugin 1.14.0.493
lightning-habana 1.3.0
optimum-habana 1.10.4
dill 0.3.7
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
- pip install dill
- python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"
Expected behavior
No error
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375264 KB
------------------------------------------------------------------------------
fails with
Exception ignored in: <function HPUGraph.__del__ at 0x7f2e140c3d90>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 104, in __del__
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 101, in reset
TypeError: 'NoneType' object is not callable
The issue has been root caused to the dill package. In the dill module having import __main__ as _main_module
in the global space of the module causes some issues with hpu graph and the destructor is overridden. The destroy function from hpu_C.so in graph destructor is getting unreferenced & undefined.
I created two patches dill-0.3.7.patch & dill-0.3.8.patch, to resolve the issue for dill-0.3.7 and dill-0.3.8, respectively.
The patch has been tested for various scenarios, it passes dill unit tests completely and to the best of my knowledge I did not find any issue.
You can also test it like below,
>> git clone -b dill-0.3.7 https://github.com/uqfoundation/dill.git
>> cd dill
>> wget https://github.com/huggingface/tgi-gaudi/files/15044823/dill-0.3.7.patch
>> git apply dill-0.3.7.patch
>> python -m pip install .
>> python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375264 KB
------------------------------------------------------------------------------
the same for dill v0.3.8 with the provided patch.
>> git clone -b 0.3.8 https://github.com/uqfoundation/dill.git
>> cd dill
>> wget https://github.com/huggingface/tgi-gaudi/files/15044824/dill-0.3.8.patch
>> git apply dill-0.3.8.patch
>> python -m pip install .
>> python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375264 KB
------------------------------------------------------------------------------
@yafshar So the solution would be to do monkey patching with these patches in TGI?
@yafshar So the solution would be to do monkey patching with these patches in TGI?
Unfortunately, yes. We can probably do a better job for dill and upstream it to the public repo, but we still have this issue for 1.140 & 1.15.0. Since we do not have this issue in 1.16.0, I think the patch is a workaround for now.
Okay I see. Would you like to open a PR to add this patch?