w3c/machine-learning-workshop

OperandType of gemm / matmul return

Closed this issue · 4 comments

kpu commented

The spec says gemm returns "an Operand" (and the same thing for matmul).

If both arguments are tensor-quant8-asymm, what is the OperandType of the return? I can see use cases for tensor-int32 which is how it will actually be generated by existing hardware, tensor-quant8-asymm for a fully quantized model, or even tensor-float32 for people that have only partly quantized their model.

This matters because the spec doesn't appear to have e.g. a requantization operator to convert int32 to int8 and anyway one would need the ability to set the scaling factor used by running the model in advance to measure an appropriate scaling factor.

Thanks for your comment. To ensure this detailed spec feedback is addressed appropriately, I've transferred the issue to the WebNN API specification repo where the API design work happens:
webmachinelearning/webnn#84

@kpu this issue has previously been discussed in webmachinelearning/webnn#44. I will be refactoring quantization-related procedural data from the OperandDescriptor type as we incorporate aspects of quantization work into the operator API.

kpu commented

@wchao1115 The issue you referenced, webmachinelearning/webnn#44, is about how the quantization scaling factor and zeropoint should be included in OperandDescriptor.

As the title of this issue says, this is about the OperandType of the return value from matmul. Should multiplying int8 by int8 return float32, int32, or include a scaling factor to go to int8?

This has nothing to do with how the scaling factor is encoded in OperandDescriptor (and your suggestion that it not be).

@kpu you are right that they are not the same issue. I only meant to point out that the issue around how to properly support quantization is not fully resolved, and that #44 is related to that whole conversation. I didn't mean to suggest that they are the same issue.