Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker
brosenfi opened this issue ยท 5 comments
I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.
Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source?
warn(
2024-04-29 07:20:32,788 - INFO - intel_extension_for_pytorch auto imported
2024-04-29 07:20:33 | INFO | model_worker | Loading the model ['neural-chat-7b-v3-3'] on worker 4e2d5da6, worker type: BigDLLLM worker...
2024-04-29 07:20:33 | INFO | model_worker | Using low bit format: fp16, device: xpu
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |****Usage Error
Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |***Call Stack
2024-04-29 07:20:33 | ERROR | stderr | Traceback (most recent call last):
2024-04-29 07:20:33 | ERROR | stderr | File "", line 198, in _run_module_as_main
2024-04-29 07:20:33 | ERROR | stderr | File "", line 88, in _run_code
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 326, in
2024-04-29 07:20:33 | ERROR | stderr | worker = BigDLLLMWorker(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 88, in init
2024-04-29 07:20:33 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/loader.py", line 67, in load_model
2024-04-29 07:20:33 | ERROR | stderr | model = model_cls.from_pretrained(model_path, **model_kwargs)
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/model.py", line 294, in from_pretrained
2024-04-29 07:20:33 | ERROR | stderr | invalidInputError(
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
2024-04-29 07:20:33 | ERROR | stderr | raise RuntimeError(errMsg)
2024-04-29 07:20:33 | ERROR | stderr | RuntimeError: Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
Hi, I am working on to reproduce this issue.
Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).