q80 and f16 models fail with Critical error: Unsupported ...

Question

q80 and f16 models fail with Critical error: Unsupported ...

Closed this issue 6 months ago · 5 comments

While conversion with convert-hf.py seems to work, models in q80 and f16 formats can not be loaded. Here's the combinations I tried with Llama-3.3-70B-Instruct:

quant	buffer-float-type	error
`q80`	`q40`	Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
`q80`	`q80`	Critical error: Unsupported CPU op code: MATMUL, quant: Q80_Q80_F32, op name: block_matmul_q
`q80`	`f16`	Critical error: Unsupported op quant: F_32/F_UNK/F_16
`q80`	`f32`	Critical error: Unsupported op quant: F_32/F_Q80/F_32
`f16`	`q40`	Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
`f16`	`q80`	Critical error: Unsupported op quant: F_Q80/F_16/F_32
`f16`	`f16`	Critical error: Unsupported op quant: F_32/F_UNK/F_16
`f16`	`f32`	Critical error: Unsupported op quant: F_32/F_16/F_32

I'm mostly interested in q80 models with f16 or higher precision for synchronization. With llama.cpp, 8-bit quantization usually yield very high performance (only slightly slower than 4-bit) without the sometimes obvious model degradation with 4-bit quantization.

Am I doing something wrong, or is support currently missing?

Answer 1 · 2025-03-08T22:50:05.000Z

Hello @lemmi,

currently DL supports:

f32 weights & f32 buffer float type
q40 weights & q80 buffer float type

Answer 2 · 2025-03-08T23:08:24.000Z

f32 is rather useless for the most part. Very large model sizes and worse performance, since it's very clearly memory bound at that point.

I wasn't able to get q40 to work with f32 buffer float type. I get
🚨 Critical error: This version supports only Q40 weights with Q80 sync type
with all the models I currently have available.

q40 with f32 could also interesting. Does this skip an extra quantization step? I'm not concerned with the extra bandwidth, since I'm using 2x10gbps per node.

Are there plans to get at least 8-bit precision working (again)?

Answer 3 · 2025-03-08T23:45:35.000Z

I wasn't able to get q40 to work with f32 buffer float type.

You're right. I updated the previous comment. In the previous version of DL, q40 weights and the f32 buffer were supported. The current version does not have all ops that support this mode. So now, DL supports only two combinations.

Are there plans to get at least 8-bit precision working (again)?

Yes, but it's hard to say when. Currently the priority is the vulkan support.

Answer 4 · 2025-03-09T00:04:07.000Z

Alright, thanks for clearing things up. In the meantime, I think it's worth pointing this out in https://github.com/b4rtaz/distributed-llama?tab=readme-ov-file#-known-limitations. I can make a PR later for that.

Answer 5 · 2025-03-23T22:28:55.000Z

The README file has been updated, so for now, I'm closing this issue.