xuduo18 opened this issue 2 years ago · 1 comments
I have tried the 8bit option as well but no change.
It generates tokens slowly and CPU goes high (>80%). GPU jumps up too but always < 20%. So it seems to be CPU hungry instead of GPU.
So by default does it inference on GPU?
This seems to be a problem with int8. In our test, it is indeed slower than fp16. We'll have an investigation into this.