Inference super slow

Question

Inference super slow

SinanAkkoyun opened this issue 2 years ago · 4 comments

Hello, I only get maybe one token/second whereas I get 30 tokens/second with the default pytorch implementation (running on an H100)

Answer 1 · 2023-06-01T06:47:57.000Z

I guess you can try inference with GPU, after making some modifications to the code:

llama/memory_pool.py:        self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])

Find all the files with import onnxruntime and add import torch before it.
Also remember to uninstall onnxruntime and install onnxruntime-gpu instead.
Note: it takes 34GB GPU memory for me to load the model, but the inference is fast.

Answer 2 · 2023-06-01T09:10:06.000Z

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

Answer 3 · 2023-06-01T12:43:17.000Z

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

I ran the 7B model downloaded from the repo’s given link. About 0.2 token/s for CPU and 20 for GPU

Answer 4 · 2023-06-02T03:47:37.000Z

1B needs 4GB memory with float32 format. It is really hard to inference fastly on single CPU.

If you want performance on mobile/laptop CPU, try InferLLM repo https://github.com/MegEngine/InferLLM
For model conversion to NPU/DSP, use llama.onnx