tpoisonooo/llama.onnx

Inference super slow

SinanAkkoyun opened this issue · 4 comments

Hello, I only get maybe one token/second whereas I get 30 tokens/second with the default pytorch implementation (running on an H100)

I guess you can try inference with GPU, after making some modifications to the code:

llama/memory_pool.py:        self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])

Find all the files with import onnxruntime and add import torch before it.
Also remember to uninstall onnxruntime and install onnxruntime-gpu instead.
Note: it takes 34GB GPU memory for me to load the model, but the inference is fast.

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

I ran the 7B model downloaded from the repo’s given link. About 0.2 token/s for CPU and 20 for GPU

1B needs 4GB memory with float32 format. It is really hard to inference fastly on single CPU.

If you want performance on mobile/laptop CPU, try InferLLM repo https://github.com/MegEngine/InferLLM
For model conversion to NPU/DSP, use llama.onnx