Inference super slow
SinanAkkoyun opened this issue · 4 comments
Hello, I only get maybe one token/second whereas I get 30 tokens/second with the default pytorch implementation (running on an H100)
I guess you can try inference with GPU, after making some modifications to the code:
llama/memory_pool.py: self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])
Find all the files with import onnxruntime
and add import torch
before it.
Also remember to uninstall onnxruntime
and install onnxruntime-gpu
instead.
Note: it takes 34GB GPU memory for me to load the model, but the inference is fast.
I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!
I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!
I ran the 7B model downloaded from the repo’s given link. About 0.2 token/s for CPU and 20 for GPU
1B needs 4GB memory with float32 format. It is really hard to inference fastly on single CPU.
If you want performance on mobile/laptop CPU, try InferLLM repo https://github.com/MegEngine/InferLLM
For model conversion to NPU/DSP, use llama.onnx