b4rtaz/distributed-llama

q80 and f16 models fail with Critical error: Unsupported ...

Closed this issue · 5 comments

While conversion with convert-hf.py seems to work, models in q80 and f16 formats can not be loaded. Here's the combinations I tried with Llama-3.3-70B-Instruct:

quant buffer-float-type error
q80 q40 Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
q80 q80 Critical error: Unsupported CPU op code: MATMUL, quant: Q80_Q80_F32, op name: block_matmul_q
q80 f16 Critical error: Unsupported op quant: F_32/F_UNK/F_16
q80 f32 Critical error: Unsupported op quant: F_32/F_Q80/F_32
f16 q40 Critical error: Unsupported op quant: F_32/F_UNK/F_Q40
f16 q80 Critical error: Unsupported op quant: F_Q80/F_16/F_32
f16 f16 Critical error: Unsupported op quant: F_32/F_UNK/F_16
f16 f32 Critical error: Unsupported op quant: F_32/F_16/F_32

I'm mostly interested in q80 models with f16 or higher precision for synchronization. With llama.cpp, 8-bit quantization usually yield very high performance (only slightly slower than 4-bit) without the sometimes obvious model degradation with 4-bit quantization.

Am I doing something wrong, or is support currently missing?

Hello @lemmi,

currently DL supports:

  • f32 weights & f32 buffer float type
  • q40 weights & q80 buffer float type

f32 is rather useless for the most part. Very large model sizes and worse performance, since it's very clearly memory bound at that point.

I wasn't able to get q40 to work with f32 buffer float type. I get
🚨 Critical error: This version supports only Q40 weights with Q80 sync type
with all the models I currently have available.

q40 with f32 could also interesting. Does this skip an extra quantization step? I'm not concerned with the extra bandwidth, since I'm using 2x10gbps per node.

Are there plans to get at least 8-bit precision working (again)?

I wasn't able to get q40 to work with f32 buffer float type.

You're right. I updated the previous comment. In the previous version of DL, q40 weights and the f32 buffer were supported. The current version does not have all ops that support this mode. So now, DL supports only two combinations.

Are there plans to get at least 8-bit precision working (again)?

Yes, but it's hard to say when. Currently the priority is the vulkan support.

Alright, thanks for clearing things up. In the meantime, I think it's worth pointing this out in https://github.com/b4rtaz/distributed-llama?tab=readme-ov-file#-known-limitations. I can make a PR later for that.

The README file has been updated, so for now, I'm closing this issue.