bf16 problem

Question

Zibri opened this issue 23 days ago · 1 comments

I have a model that has been converted from the original to bf16...
now I want to make some quantization testing with that but quantize says:

cannot dequantize/convert tensor type bf16

I don't understand why, since bf16 and f16 are not that different...

Answer 1 · 2024-05-20T20:27:15.000Z

If you first up cast to fp32 it will probably work as bf16 if just fp32 with 16 of the precision bits truncated:

It will just set these extra bits to zero when you up cast and then quanize should work on the fp32 model (I think).