Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker

Question

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker

brosenfi opened this issue a month ago · 5 comments

I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.

Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:

/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-04-29 07:20:32,788 - INFO - intel_extension_for_pytorch auto imported
2024-04-29 07:20:33 | INFO | model_worker | Loading the model ['neural-chat-7b-v3-3'] on worker 4e2d5da6, worker type: BigDLLLM worker...
2024-04-29 07:20:33 | INFO | model_worker | Using low bit format: fp16, device: xpu
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

****Usage Error
Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

***Call Stack
2024-04-29 07:20:33 | ERROR | stderr | Traceback (most recent call last):
2024-04-29 07:20:33 | ERROR | stderr | File "", line 198, in _run_module_as_main
2024-04-29 07:20:33 | ERROR | stderr | File "", line 88, in _run_code
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 326, in
2024-04-29 07:20:33 | ERROR | stderr | worker = BigDLLLMWorker(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 88, in init
2024-04-29 07:20:33 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/loader.py", line 67, in load_model
2024-04-29 07:20:33 | ERROR | stderr | model = model_cls.from_pretrained(model_path, **model_kwargs)
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/model.py", line 294, in from_pretrained
2024-04-29 07:20:33 | ERROR | stderr | invalidInputError(
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
2024-04-29 07:20:33 | ERROR | stderr | raise RuntimeError(errMsg)
2024-04-29 07:20:33 | ERROR | stderr | RuntimeError: Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.

Answer 1 · 2024-04-29T01:23:52.000Z

Hi, I am working on to reproduce this issue.

Answer 2 · 2024-04-29T01:51:52.000Z

Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).

Answer 3 · 2024-04-29T02:19:41.000Z

Hi, this issue actually contains two parts:

A bug that is caused by using low_bit fp16 in ipex_llm_worker.
Feature request: support ipex_llm_worker with speculative decoding.

The first part have been fixed by this pr #10907.

The second part will be supported by @hzjane.

Answer 4 · 2024-04-29T08:06:12.000Z

@brosenfi
The self-speculative decoding using fastchat worker will be supported in this PR.
But the speculative example only supports running on intel max GPU due to the memory usage limitations. You can try it on max GPU or CPU later.

Answer 5 · 2024-05-07T14:16:42.000Z

Thank you @gc-fu