PygmalionAI/aphrodite-engine

[Installation]: Installing from source does not work. undefined symbol: _ZN3c104cuda14ExchangeDeviceEa

Nero10578 opened this issue · 8 comments

Your current environment

The output of `python env.py`

PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA TITAN X (Pascal)
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-5775C CPU @ 3.30GHz
CPU family: 6
Model: 71
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 6599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 128 KiB (4 instances)
L1i cache: 128 KiB (4 instances)
L2 cache: 1 MiB (4 instances)
L3 cache: 6 MiB (1 instance)
L4 cache: 512 MiB (4 instances)
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] blas 2.116 mkl conda-forge
[conda] blas-devel 3.9.0 16_linux64_mkl conda-forge
[conda] libblas 3.9.0 16_linux64_mkl conda-forge
[conda] libcblas 3.9.0 16_linux64_mkl conda-forge
[conda] liblapack 3.9.0 16_linux64_mkl conda-forge
[conda] liblapacke 3.9.0 16_linux64_mkl conda-forge
[conda] mkl 2022.1.0 h84fe81f_915 conda-forge
[conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge
[conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge
[conda] numpy 1.26.4 pypi_0 pypi
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.2.0 pypi_0 pypi
[conda] torchtriton 2.2.0 py311 pytorchROCM Version: Could not collect
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

How did you install Aphrodite?

The installation using ./update-runtime.sh seems to run just fine:

 Building editable for aphrodite-engine (pyproject.toml) ... done
  Created wheel for aphrodite-engine: filename=aphrodite_engine-0.5.2-0.editable-cp311-cp311-linux_x86_64.whl size=18296 sha256=92551a3b020e409298f76e2c7c6dd70b9ef58f5341875beb6783038e16995f3c
  Stored in directory: /tmp/pip-ephem-wheel-cache-47pacvfu/wheels/cd/bd/20/f28732262b5ba5c76598e5772372ef904d247cf5045e9a1949
Successfully built aphrodite-engine
Installing collected packages: aphrodite-engine

Successfully installed aphrodite-engine-0.5.2
Remote version of pip: 24.0
Local version of pip:  24.0
Was pip installed by pip? False
Removed build tracker: '/tmp/pip-build-tracker-rj27ycun'

However, when trying to run aphrodite-engine I keep getting an error that looks to me like a CUDA related error usually. I don't get it since my system works just fine using a fresh miniconda environment and doing the quick pip install aphrodite-engine. This install method also EDIT: USED TO works just fine when I use an RTX 30 series GPU in the same system. Now it doesn't work at all on any GPU.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/owen/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 33, in <module>
    from aphrodite.endpoints.openai.serving_chat import OpenAIServingChat
  File "/home/owen/aphrodite-engine/aphrodite/endpoints/openai/serving_chat.py", line 16, in <module>
    from aphrodite.modeling.outlines_decoding import (
  File "/home/owen/aphrodite-engine/aphrodite/modeling/__init__.py", line 2, in <module>
    from aphrodite.modeling.loader import get_model
  File "/home/owen/aphrodite-engine/aphrodite/modeling/loader.py", line 13, in <module>
    from aphrodite.modeling.hf_downloader import (
  File "/home/owen/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 21, in <module>
    from aphrodite.modeling.layers.quantization import (get_quantization_config,
  File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/quantization/__init__.py", line 4, in <module>
    from aphrodite.modeling.layers.quantization.aqlm import AQLMConfig
  File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/quantization/aqlm.py", line 12, in <module>
    from aphrodite._C import ops
ImportError: /home/owen/aphrodite-engine/aphrodite/_C.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa

Seems like just the latest commit is broken. Older commit works fine.

Edit: Nevermind

I've been getting the same error.

I've been getting the same error.

I got it to work by adding flash-attn, chardet and removing the versions for xformers and torch in the requirements.txt file. The issue is now it says it can't find flash-attn and is using xformers.

Weird, does this only happen with Pascal GPUs? I can't reproduce this on Ampere, and I don't have any pascal gpus. Can you try #454?

Weird, does this only happen with Pascal GPUs? I can't reproduce this on Ampere, and I don't have any pascal gpus. Can you try #454?

I only realized this issue when reinstalling aphrodite when I changed to a GTX Titan X Pascal 12GB. But now that I tried reinstalling it in my main PC with a RTX 4090 I am having the same issue.

I can install it from the release wheel 0.5.2 just fine if I install cuda toolkit first in the python conda environment I manually created.

It's not installing properly using ./update-runtime or pip install e . . It was working just last week and the previous commits even with the commit that the release 0.5.2 was made from does not work. All I can think of is that this must have something to do with one of the dependencies that does not have a version set.

Issues I found when trying to run aphrodite after building from source:
I noticed that I could fix the undefined symbol: _ZN3c104cuda14ExchangeDeviceEa error by installing the latest pytorch, but then it complains that xformers was built for pytorch 2.2.0 not the latest version 2.3.0.

So I reinstalled xformers using pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121 which got it to run further but then it errors out and says that I need to install 'chardet'.

After installing chardet, it runs properly but then says flash-attn not found falling back to xformers. If I install it with pip install flash-attn, it still says flash-attn not found. Not sure what is the problem here.

Steps:

  1. Install from source latest commit 205c8e4106a9fc0cdc45102ea11b0eed80a807aa with ./update-runtime.sh
  2. Try to run aphrodite with result
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/owen/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 33, in <module>
    from aphrodite.endpoints.openai.serving_chat import OpenAIServingChat
  File "/home/owen/aphrodite-engine/aphrodite/endpoints/openai/serving_chat.py", line 16, in <module>
    from aphrodite.modeling.outlines_decoding import (
  File "/home/owen/aphrodite-engine/aphrodite/modeling/__init__.py", line 2, in <module>
    from aphrodite.modeling.loader import get_model
  File "/home/owen/aphrodite-engine/aphrodite/modeling/loader.py", line 13, in <module>
    from aphrodite.modeling.hf_downloader import (
  File "/home/owen/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 21, in <module>
    from aphrodite.modeling.layers.quantization import (get_quantization_config,
  File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/quantization/__init__.py", line 4, in <module>
    from aphrodite.modeling.layers.quantization.aqlm import AQLMConfig
  File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/quantization/aqlm.py", line 12, in <module>
    from aphrodite._C import ops
ImportError: /home/owen/aphrodite-engine/aphrodite/_C.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa
  1. Install pytorch with ./runtime.sh conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
  2. Try to run aphrodite:
Traceback (most recent call last):
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/requests/compat.py", line 11, in <module>
    import chardet
ModuleNotFoundError: No module named 'chardet'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/home/owen/aphrodite-engine/aphrodite/__init__.py", line 1, in <module>
    from aphrodite.engine.args_tools import AsyncEngineArgs, EngineArgs
  File "/home/owen/aphrodite-engine/aphrodite/engine/args_tools.py", line 6, in <module>
    from aphrodite.common.config import (
  File "/home/owen/aphrodite-engine/aphrodite/common/config.py", line 9, in <module>
    from transformers import PretrainedConfig
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
    from .utils.versions import require_version, require_version_core
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/transformers/utils/__init__.py", line 18, in <module>
    from huggingface_hub import get_full_repo_name  # for backward compatibility
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/__init__.py", line 503, in __getattr__
    submod = importlib.import_module(submod_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 45, in <module>
    import requests
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/requests/__init__.py", line 45, in <module>
    from .exceptions import RequestsDependencyWarning
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/requests/exceptions.py", line 9, in <module>
    from .compat import JSONDecodeError as CompatJSONDecodeError
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/requests/compat.py", line 13, in <module>
    import charset_normalizer as chardet
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/charset_normalizer/__init__.py", line 23, in <module>
    from charset_normalizer.api import from_fp, from_path, from_bytes, normalize
  File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/charset_normalizer/api.py", line 10, in <module>
    from charset_normalizer.md import mess_ratio
  File "charset_normalizer/md.py", line 5, in <module>
ImportError: cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant' (/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/charset_normalizer/constant.py)
  1. Install chardet with ./runtime.sh pip install chardet
  2. Try to run aphrodite:
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.2.0+cu121 with CUDA 1201 (you have 2.3.0)
    Python  3.11.7 (you have 3.11.9)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
INFO:     Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance.
But it may cause slight accuracy drop. Currently we only support fp8 without scaling factors and use e5m2 as a default
format.
2024-05-07 05:00:55,532 INFO worker.py:1749 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = '/home/owen/models/Awanllm-Llama-3-8B-Instruct-DPO-v0.1'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = fp8_e5m2
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     flash_attn is not found. Using xformers backend.
INFO:     Model weights loaded. Memory usage: 14.96 GiB x 1 = 14.96 GiB
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in <module>
[rank0]:     engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
[rank0]:     engine = cls(parallel_config.worker_use_ray,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
[rank0]:     self.model_executor = executor_class(model_config, cache_config,
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 71, in __init__
[rank0]:     self._init_cache()
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 237, in _init_cache
[rank0]:     num_blocks = self._run_workers(
[rank0]:                  ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers
[rank0]:     driver_worker_output = getattr(self.driver_worker,
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/task_handler/worker.py", line 132, in profile_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 765, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 700, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/models/llama.py", line 426, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/models/llama.py", line 351, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/models/llama.py", line 298, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/models/llama.py", line 228, in forward
[rank0]:     attn_output = self.attn(
[rank0]:                   ^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/attention/__init__.py", line 67, in forward
[rank0]:     return self.backend.forward(query, key, value, key_cache, value_cache,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/attention/backends/xformers.py", line 144, in forward
[rank0]:     output = self._run_memory_efficient_xformer_forward(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/aphrodite/modeling/layers/attention/backends/xformers.py", line 213, in _run_memory_efficient_xformer_forward
[rank0]:     out = xops.memory_efficient_attention_forward(
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 244, in memory_efficient_attention_forward
[rank0]:     return _memory_efficient_attention_forward(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 337, in _memory_efficient_attention_forward
[rank0]:     op = _dispatch_fw(inp, False)
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/xformers/ops/fmha/dispatch.py", line 120, in _dispatch_fw
[rank0]:     return _run_priority_list(
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/xformers/ops/fmha/dispatch.py", line 63, in _run_priority_list
[rank0]:     raise NotImplementedError(msg)
[rank0]: NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
[rank0]:      query       : shape=(1, 8192, 8, 4, 128) (torch.bfloat16)
[rank0]:      key         : shape=(1, 8192, 8, 4, 128) (torch.bfloat16)
[rank0]:      value       : shape=(1, 8192, 8, 4, 128) (torch.bfloat16)
[rank0]:      attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
[rank0]:      p           : 0.0
[rank0]: `flshattF@0.0.0` is not supported because:
[rank0]:     xFormers wasn't build with CUDA support
[rank0]:     operator wasn't built - see `python -m xformers.info` for more info
[rank0]: `tritonflashattF` is not supported because:
[rank0]:     xFormers wasn't build with CUDA support
[rank0]:     attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
[rank0]:     operator wasn't built - see `python -m xformers.info` for more info
[rank0]:     operator does not support BMGHK format
[rank0]:     triton is not available
[rank0]:     Only work on pre-MLIR triton for now
[rank0]: `cutlassF` is not supported because:
[rank0]:     xFormers wasn't build with CUDA support
[rank0]:     operator wasn't built - see `python -m xformers.info` for more info
[rank0]: `smallkF` is not supported because:
[rank0]:     max(query.shape[-1] != value.shape[-1]) > 32
[rank0]:     xFormers wasn't build with CUDA support
[rank0]:     dtype=torch.bfloat16 (supported: {torch.float32})
[rank0]:     attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
[rank0]:     has custom scale
[rank0]:     operator wasn't built - see `python -m xformers.info` for more info
[rank0]:     operator does not support BMGHK format
[rank0]:     unsupported embed per head: 128
  1. Reinstall xformers with ./runtime.sh pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
  2. Aphrodite runs but says 'INFO: flash_attn is not found. Using xformers backend.'
  3. Reinstall flash-attn with ./runtime.sh pip install flash-attn
  4. Makes no difference it still says flash-attn not found.

This should be fixed with the latest release.

This should be fixed with the latest release.

Can confirm, works great now! Awesome work! Thank you!

Although the performance on my GTX Titan X Pascal 12GB somehow got halved. Was there a kernel change for running GGUF that changed from FP32 to FP16 or something? Since the Pascal non P100 cards have crappy FP16 performance.

EDIT: Nevermind I rebooted my machine somehow its fast again lol