baileytec-labs/llama-on-lambda

Use ARM64 to increase performance

Opened this issue · 6 comments

I did some experiments and looks like arm64 has better performance.

x86_64: >300ms/token
arm64: ~250ms/token

I did some experiments and looks like arm64 has better performance.

x86_64: >300ms/token arm64: ~250ms/token

This is awesome. On top of that, ARM Lambda instances run cheaper. If it's cheaper and uses fewer GB/s to get the same job done, that's a massive win. I'll update the repo with this information! Thanks!

Hi there, I'm looking into exploring llama.cpp on ARM. It looks like we'd have to update the base images in the containers, and then...? I think I might be missing something here as the actual llama.cpp build process needs to be switched over to arm64 somehow.

Any help? It looks like the 86x version is greatly in excess of 300 ms per token for me, almost 10x as slow, which is pretty unusable. I can report back here if we make some progress and it turns into something useful in the end.

I can get about 150ms/token generation speed with 7B Q4_0 quaternized model on x86_64 using Golang binding and avx2 enabled.

The latest llama.cpp seems to have issues on Graviton processor. It won't even compile.

Oh, that is strange. I'll have to take a look at that. Is that a particular github project? I've heard rumors of the golang bindings for a little while.

I've been wondering if it's partially because I'm compiling it on my M1 mac too as a part of the docker build process. A lot of my local builds have been...painfully slow for some inane/bizzare reason :'((((

@tysam-code Check out the llama-cpp-python container and definitions/cdk code in this repo

Quantized models absolutely fly on Graviton, as per my recent research.

Okay, this looks pretty promising, and the table is highly appreciated! Good work, many thanks. :)

I found what may be a small error in the code for the GPU models. They seem to be slower than what they should be, and I believe it's because the llama-cpp-python builds the binaries against whatever machine is downloading them, which means I believe on the GPU instances, it's actually just using the CPU cores attached (usually used for dataloading, preprocessing, etc misc tasks).

As a rough outline, a single 2070 apparently did a non-quantized 7B model in ~15 seconds or so (https://kubito.dev/posts/llama-nvidia-3070-ti-benchmarks/), and the 2070 has few tensor core tflops than the T4 does. I'd expect a T4 to do 512 tokens in maybe 12 seconds or less, just thumb-in-the-wind, however, still adjusting my hardware-based expectations.

I don't know a good solution for this, beyond potentially switching to pulling precompiled binaries and swapping that out for whatever llama-cpp-python is doing under the hood. If it's even that modular (and I hope it is! ;P If it's just 'dumbly' calling the process under the hood, then that's probably the best-case scenario for us here. :)))) )