Unknown quantization method: flute
Closed this issue · 6 comments
ValueError: Unknown quantization method: flute. Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'fbgemm_fp8', 'modelopt', 'marlin', 'gguf', 'gptq_marlin_24', 'gptq_marlin', 'awq_marlin', 'gptq', 'compressed-tensors', 'bitsandbytes', 'qqq', 'experts_int8', 'neuron_quant'].
Hi, thanks for trying it out!
Do you mind describing what integration you are using (vLLM I assume) and the command you used?
Sorry for the late reply, here are the details added for the issue:
The bash command:
python -m flute.integrations.vllm vllm.entrypoints.openai.api_server --model . /gemma-2-27b-it-flute --quantisation flute
Here is the detailed ERROR message:
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:22: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_simple_80")
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:45: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_simple_86")
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:68: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_simple_89")
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:91: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_raw_simple_80")
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:107: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_raw_simple_86")
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/ops.py:123: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("flute::qgemm_raw_simple_89")
[FLUTE]: Using A100 with CC=(8, 0)
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/init.py:91: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
TEMPLATE_CONFIGS = torch.load(TEMPLATE_CONFIGS_PATH)
[FLUTE]: Template configs loaded from /home/miniconda/envs/quant/lib/python3.11/site-packages/flute/data/qgemm_kernel_raw_generated_configs.pth
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/init.py:107: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
TEMPLATE_TUNED_WITH_M_CONFIGS = torch.load(TEMPLATE_TUNED_WITH_M_CONFIGS_PATH)
[FLUTE]: Template (tuned, with M) configs loaded from /home/miniconda/envs/quant/lib/python3.11/site-packages/flute/data/qgemm_kernel_raw_tuned_configs.pth
/home/miniconda/envs/quant/lib/python3.11/site-packages/flute/init.py:114: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
TEMPLATE_TUNED_WITHOUT_M_CONFIGS = torch.load(TEMPLATE_TUNED_WITHOUT_M_CONFIGS_PATH)
[FLUTE]: Template (tuned, without M) configs loaded from /home/miniconda/envs/quant/lib/python3.11/site-packages/flute/data/qgemm_kernel_raw_tuned_configs.no-M.pth
INFO 09-23 11:00:23 api_server.py:495] vLLM API server version 0.6.1.post2
INFO 09-23 11:00:23 api_server.py:496] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='./gemma-2-27b-it-flute', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='flute', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None)
WARNING 09-23 11:00:23 utils.py:727] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 09-23 11:00:23 config.py:335] flute quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 09-23 11:00:23 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/0ce86f01-fad2-4691-b301-3f3a9754c9f1 for RPC Path.
INFO 09-23 11:00:23 api_server.py:178] Started engine process with PID 6032
WARNING 09-23 11:00:27 utils.py:727] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/miniconda/envs/quant/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/miniconda/envs/quant/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in init
self.engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 568, in from_engine_args
engine_config = engine_args.create_engine_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/config.py", line 243, in init
self._verify_quantization()
File "/home/miniconda/envs/quant/lib/python3.11/site-packages/vllm/config.py", line 321, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: flute. Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'fbgemm_fp8', 'modelopt', 'marlin', 'gguf', 'gptq_marlin_24', 'gptq_marlin', 'awq_marlin', 'gptq', 'compressed-tensors', 'bitsandbytes', 'qqq', 'experts_int8', 'neuron_quant'].
ERROR 09-23 11:00:28 api_server.py:188] RPCServer process died before responding to readiness probe
Here's my python environment info:
Package Version
accelerate 0.34.2
aiohappyeyeballs 2.4.0
aiohttp 3.10.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.5.0
attrs 24.2.0
bitsandbytes 0.42.0
certifi 2024.8.30
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
contourpy 1.3.0
cycler 0.12.1
datasets 3.0.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
fastapi 0.115.0
filelock 3.16.1
flute-kernel 0.0.7
fonttools 4.53.1
frozenlist 1.4.1
fsspec 2024.6.1
gguf 0.9.1
h11 0.14.0
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.2
huggingface-hub 0.25.0
idna 3.10
importlib_metadata 8.5.0
intel_extension_for_pytorch 2.4.0
interegular 0.3.3
jaxtyping 0.2.34
Jinja2 3.1.4
jiter 0.5.0
jsonschema 4.23.0
jsonschema-specifications 2023.12.1
kiwisolver 1.4.7
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.6
MarkupSafe 2.1.5
matplotlib 3.9.2
mistral_common 1.4.2
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.18.6
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.68
nvidia-nvtx-cu12 12.1.105
openai 1.46.0
opencv-python-headless 4.10.0.84
outlines 0.0.46
packaging 24.1
pandas 2.2.2
partial-json-parser 0.2.1.1.post4
pillow 10.4.0
pip 24.2
prometheus_client 0.20.0
prometheus-fastapi-instrumentator 7.0.0
protobuf 5.28.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 17.0.0
pycountry 24.6.1
pydantic 2.9.2
pydantic_core 2.23.4
pyparsing 3.1.4
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
ray 2.36.0
referencing 0.35.1
regex 2024.9.11
requests 2.32.3
rpds-py 0.20.0
safetensors 0.4.5
scipy 1.14.1
sentencepiece 0.2.0
setuptools 75.1.0
six 1.16.0
sniffio 1.3.1
starlette 0.38.5
sympy 1.13.3
tiktoken 0.7.0
tokenizers 0.19.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.66.5
transformers 4.44.2
triton 3.0.0
typeguard 2.13.3
typing_extensions 4.12.2
tzdata 2024.1
urllib3 2.2.3
uvicorn 0.30.6
uvloop 0.20.0
vllm 0.6.1.post2
vllm-flash-attn 2.6.1
watchfiles 0.24.0
websockets 13.0.1
wheel 0.44.0
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.11.1
zipp 3.20.2
zstandard 0.23.0
thank you for the information!
Would it be possible to use --quantization
instead of --quantisation
? (Noticed "z" instead of "s")
thank you for the information!谢谢您的信息!
Would it be possible to use
--quantization
instead of--quantisation
? (Noticed "z" instead of "s")是否可以使用--quantization
而不是--quantisation
? (注意到“z”而不是“s”)
Your suggestion worked! Thank you so much for your prompt response.
Great, glad this helped!