'nf4' compute datatype?

Question

'nf4' compute datatype?

dorsa-zeinali opened this issue 2 months ago · 1 comments

Feature request

In the quantization procedure for qlora, there is the 'nf'4 storage datatype and the compute datatype (in the paper bfloat16 which is the original)(please refer to the image). They then dequantize the value to the compute datatype for inference or calculating the backward pass. When I tried using int8 for the compute datatype, matrix multiplication threw an error for not being supported for this datatype. I have not tried inference with qint8(). Is it possible to make 'nf4' as a possible computation datatype, and have the relevant functions be able to handle this?

Motivation

Dequantizing a value for performing calculations and storing those results and updates in current full precision (even though in qlora, only a small set of adapter weights are updated), is still inefficient and undoable especially for hardware on edge devices. Doing research towards performing calculations accurately with weights still in 4 bits would be a desirable improvement.

Your contribution

I can try to submit a PR for this. I would just need some guidance in the right direction to help me get started.

Answer 1 · 2024-08-23T15:13:00.000Z

Hi,
For nf4 quantization we only support computation with fp32, fp16, or bf16. We also do not quantize the activations.

Can you clarify by what you mean with edge devices and what the goal is?