Can inference be run on consumer hardware?
GrahamboJangles opened this issue · 8 comments
AMD? CPU? Single GPU?
Is this all possible via FastChat?
@GrahamboJangles It is already in FastChat. https://github.com/lm-sys/FastChat#longchat
We currently test it in A100 single GPU and it works pretty well. We are adding more support to let it run more efficiently. Let me know whether it works for your hardware, and we can improve the system support!
@DachengLi1 I have 2 RX6800s, I'm guessing that they are not yet supported?
Regarding RX Series, please see the discussion here. The inference is backed by FastChat, and it seems people can AMD card working. Can you run (there is no load-8-bit yet):
python3 -m fastchat.serve.cli --model-path lmsys/longchat-7b-16k
and let me know if it works for you? Also feel free to submit an issue in FastChat regarding this.
@DachengLi1 thank you for your help and quick responses.
I ran that command and this was the output:
python -m fastchat.serve.cli --model-path longchat-7b-16k
Traceback (most recent call last):
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1146, in _get_module
return importlib.import_module("." + module_name, self.__name__)
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\models\llama\modeling_llama.py", line 31, in <module>
from ...modeling_utils import PreTrainedModel
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\modeling_utils.py", line 83, in <module>
from accelerate import __version__ as accelerate_version
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\__init__.py", line 7, in <module>
from .accelerator import Accelerator
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\accelerator.py", line 33, in <module>
from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\tracking.py", line 45, in <module>
import wandb
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\__init__.py", line 26, in <module>
from wandb import sdk as wandb_sdk
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\__init__.py", line 5, in <module>
from . import wandb_helper as helper # noqa: F401
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\wandb_helper.py", line 6, in <module>
from .lib import config_util
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\lib\config_util.py", line 7, in <module>
from wandb.util import load_yaml
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\util.py", line 52, in <module>
import sentry_sdk # type: ignore
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\__init__.py", line 1, in <module>
from sentry_sdk.hub import Hub, init
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\hub.py", line 8, in <module>
from sentry_sdk.scope import Scope
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\scope.py", line 7, in <module>
from sentry_sdk.utils import logger, capture_internal_exceptions
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 887, in <module>
HAS_REAL_CONTEXTVARS, ContextVar = _get_contextvars()
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 857, in _get_contextvars
if not _is_contextvars_broken():
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 791, in _is_contextvars_broken
import gevent # type: ignore
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\__init__.py", line 86, in <module>
from gevent._hub_local import get_hub
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\_hub_local.py", line 101, in <module>
import_c_accel(globals(), 'gevent.__hub_local')
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\_util.py", line 148, in import_c_accel
mod = importlib.import_module(cname)
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "src\\gevent\\_hub_local.py", line 1, in init gevent._gevent_c_hub_local
ValueError: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 160 from C header, got 40 from PyObject
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\serve\cli.py", line 26, in <module>
from fastchat.model.model_adapter import add_model_args
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\model\__init__.py", line 1, in <module>
from fastchat.model.model_adapter import (
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\model\model_adapter.py", line 16, in <module>
from transformers import (
File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1137, in __getattr__
value = getattr(module, name)
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1136, in __getattr__
module = self._get_module(self._class_to_module[name])
File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1148, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
greenlet.greenlet size changed, may indicate binary incompatibility. Expected 160 from C header, got 40 from PyObject
@GrahamboJangles Thanks for trying it out! Can you submit this to FastChat system? I will also ask the FastChat team to look into it there.
@DachengLi1 Absolutely! Thanks again for your help.
@DachengLi1 I was trying to run inference using Longchat-7b-16k on an A100 machine comprising a 40GB GPU. I get a cuda out-of-memory error as the memory was not sufficient. The texts I was using as input from a parquet file were around 9k tokens each. Can you tell me about the upcoming roadmap for efficiency gains and any ETA for it so that I can run inference using lesser resources?
@sejalchopra97 For now you can run 9k tokens with flash attention support (but that does not support kv cache so it will be slow). We just got a member working on it on the vLLM side, once she got it done, we can update here. @LiuXiaoxuanPKU, let me know if you have any suggestion!