huggingface/tgi-gaudi

HPUGraph destructor issue when installing dill

yafshar opened this issue · 6 comments

System Info

platform: Linux
tgi_version: v1.2.1

habana-torch-dataloader                1.14.0.493
habana-torch-plugin                    1.14.0.493
lightning-habana                       1.3.0
optimum-habana                         1.10.4
dill                                   0.3.7

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  • pip install dill
  • python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"

Expected behavior

No error

============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375264 KB
------------------------------------------------------------------------------

fails with

 Exception ignored in: <function HPUGraph.__del__ at 0x7f2e140c3d90>
 Traceback (most recent call last):
   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 104, in __del__
   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 101, in reset
 TypeError: 'NoneType' object is not callable

The issue has been root caused to the dill package. In the dill module having import __main__ as _main_module in the global space of the module causes some issues with hpu graph and the destructor is overridden. The destroy function from hpu_C.so in graph destructor is getting unreferenced & undefined.

I created two patches dill-0.3.7.patch & dill-0.3.8.patch, to resolve the issue for dill-0.3.7 and dill-0.3.8, respectively.

The patch has been tested for various scenarios, it passes dill unit tests completely and to the best of my knowledge I did not find any issue.

You can also test it like below,

>> git clone -b dill-0.3.7 https://github.com/uqfoundation/dill.git
>> cd dill
>> wget https://github.com/huggingface/tgi-gaudi/files/15044823/dill-0.3.7.patch
>> git apply dill-0.3.7.patch
>> python -m pip install .
>> python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375264 KB
------------------------------------------------------------------------------

the same for dill v0.3.8 with the provided patch.

>> git clone -b 0.3.8 https://github.com/uqfoundation/dill.git
>> cd dill
>> wget https://github.com/huggingface/tgi-gaudi/files/15044824/dill-0.3.8.patch
>> git apply dill-0.3.8.patch
>> python -m pip install .
>> python -c "import torch; import habana_frameworks.torch as ht; g = ht.hpu.HPUGraph();"
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375264 KB
------------------------------------------------------------------------------

@yafshar So the solution would be to do monkey patching with these patches in TGI?

@yafshar So the solution would be to do monkey patching with these patches in TGI?

Unfortunately, yes. We can probably do a better job for dill and upstream it to the public repo, but we still have this issue for 1.140 & 1.15.0. Since we do not have this issue in 1.16.0, I think the patch is a workaround for now.

Okay I see. Would you like to open a PR to add this patch?

@regisss, I will open a PR for the patch

@yafshar The patch shouldn't be needed anymore with SynapseAI v1.16, I'll keep you updated when it is released.