kssteven418/I-BERT

Can use the CPU in the inference state?

luoling1993 opened this issue · 1 comments

Excellent work!

Can use the CPU in the inference state?
And how much faster than baseline?

Thanks for your interest!
I should first mention that this PyTorch implementation of I-BERT only searches for the integer parameters (i.e., performs quantization-aware-training) that minimize the accuracy degradation as compared to the full-precision counterpart.
As far as I know, PyTorch does not support integer operations (unless using its own quantization library, whose functionality is very limited) and thus the current PyTorch implementation does not achieve latency reduction on real hardware by itself.
In order to deploy I-BERT on GPU or CPU and achieve speedup, you should additionally export the integer parameters (which are obtained from this implementation) along with the model architecture to other frameworks that support deployment on integer processing units. TVM and TensorRT are such examples.

Hope this answers your question!