update the base image from 1.14 to 1.15

Question

update the base image from 1.14 to 1.15

yafshar opened this issue 8 months ago · 4 comments

Currently the base image is from,

vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

to support the new model Mixtral-8x7B and other variants there is a need to upgrade to 1.15.

Only upgrading the components of optimum-habana -> 1.11.0 & transformers -> 4.38.2 causes other issues

2024-04-10T21:55:13.283860Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 68, in Warmup
2024-04-10T21:55:13.283861Z DEBUG text_generation_launcher:     self.model.warmup(batches)
2024-04-10T21:55:13.283862Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 1080, in warmup
2024-04-10T21:55:13.283863Z DEBUG text_generation_launcher:     _, prefill_batch = self.generate_token([batches.pop(0)])
2024-04-10T21:55:13.283864Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
2024-04-10T21:55:13.283865Z DEBUG text_generation_launcher:     return func(*args, **kwds)
2024-04-10T21:55:13.283866Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 918, in generate_token
2024-04-10T21:55:13.283868Z DEBUG text_generation_launcher:     batch.logits, batch.past = self.forward(
2024-04-10T21:55:13.283869Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 818, in forward
2024-04-10T21:55:13.283870Z DEBUG text_generation_launcher:     outputs = self.model.forward(**kwargs)
2024-04-10T21:55:13.283872Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 661, in forward
2024-04-10T21:55:13.283873Z DEBUG text_generation_launcher:     return wrapped_hpugraph_forward(cache, stream, orig_fwd, args, kwargs, disable_tensor_cache, asynchronous, dry_run, max_graphs)
2024-04-10T21:55:13.283874Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 585, in wrapped_hpugraph_forward
2024-04-10T21:55:13.283875Z DEBUG text_generation_launcher:     cached.graph.replayV3(input_tensor_list, cached.asynchronous)
2024-04-10T21:55:13.283876Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 71, in replayV3
2024-04-10T21:55:13.283877Z DEBUG text_generation_launcher:     _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
2024-04-10T21:55:13.283878Z DEBUG text_generation_launcher: RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
2024-04-10T21:55:13.283880Z DEBUG text_generation_launcher: Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:41707 QueueStatus:ThreadPool m_tasks size: 1 irValue:id_110063_hpu__input
2024-04-10T21:55:13.283883Z DEBUG text_generation_launcher: [Rank:0] Habana exception raised from ValidateSyncInputTensors at hpu_lazy_tensors.cpp:875

@regisss or others any help or hint here! (to enable supporting Mixtral-8x7B)
Is there any plan or other PR for upgrading to 1.15.0?
I am working on an upgrade but I might need some help

Answer 1 · 2024-04-11T08:20:50.000Z

Hi @yafshar, yes it's planned and should be delivered soon!
cc @kdamaszk

Answer 2 · 2024-04-11T21:48:56.000Z

@kdamaszk can you please share the work. I can also provide help to do extra test or add any missing feature. I rather do not open a PR myself and discard it later

Answer 3 · 2024-04-11T22:05:43.000Z

@regisss I resolved the error. But I do not open a new PR if @kdamaszk is working on it. I can contribute there.

Answer 4 · 2024-05-06T08:36:00.000Z

@yafshar sorry for the delayed response.
I believe this issue is already addressed by #134