q80 and f16 models fail with Critical error: Unsupported ...
Closed this issue · 5 comments
While conversion with convert-hf.py
seems to work, models in q80
and f16
formats can not be loaded. Here's the combinations I tried with Llama-3.3-70B-Instruct
:
quant | buffer-float-type | error |
---|---|---|
q80 |
q40 |
Critical error: Unsupported op quant: F_32/F_UNK/F_Q40 |
q80 |
q80 |
Critical error: Unsupported CPU op code: MATMUL, quant: Q80_Q80_F32, op name: block_matmul_q |
q80 |
f16 |
Critical error: Unsupported op quant: F_32/F_UNK/F_16 |
q80 |
f32 |
Critical error: Unsupported op quant: F_32/F_Q80/F_32 |
f16 |
q40 |
Critical error: Unsupported op quant: F_32/F_UNK/F_Q40 |
f16 |
q80 |
Critical error: Unsupported op quant: F_Q80/F_16/F_32 |
f16 |
f16 |
Critical error: Unsupported op quant: F_32/F_UNK/F_16 |
f16 |
f32 |
Critical error: Unsupported op quant: F_32/F_16/F_32 |
I'm mostly interested in q80
models with f16
or higher precision for synchronization. With llama.cpp, 8-bit quantization usually yield very high performance (only slightly slower than 4-bit) without the sometimes obvious model degradation with 4-bit quantization.
Am I doing something wrong, or is support currently missing?
Hello @lemmi,
currently DL supports:
f32
weights &f32
buffer float typeq40
weights &q80
buffer float type
f32
is rather useless for the most part. Very large model sizes and worse performance, since it's very clearly memory bound at that point.
I wasn't able to get q40
to work with f32
buffer float type. I get
🚨 Critical error: This version supports only Q40 weights with Q80 sync type
with all the models I currently have available.
q40
with f32
could also interesting. Does this skip an extra quantization step? I'm not concerned with the extra bandwidth, since I'm using 2x10gbps per node.
Are there plans to get at least 8-bit precision working (again)?
I wasn't able to get q40 to work with f32 buffer float type.
You're right. I updated the previous comment. In the previous version of DL, q40
weights and the f32
buffer were supported. The current version does not have all ops that support this mode. So now, DL supports only two combinations.
Are there plans to get at least 8-bit precision working (again)?
Yes, but it's hard to say when. Currently the priority is the vulkan support.
Alright, thanks for clearing things up. In the meantime, I think it's worth pointing this out in https://github.com/b4rtaz/distributed-llama?tab=readme-ov-file#-known-limitations. I can make a PR later for that.
The README file has been updated, so for now, I'm closing this issue.