blog-optimize-arm64-llama.cpp

Pushing Graviton instances to its limits

Introduction

One day, I encountered an article.

Run a Large Language model (LLM) chatbot on Arm servers - (learn.arm.com)

The article seems to be good introduction for users. But I immediately noticed two different optimization strategies.

  1. Build with optimal SIMD flag
  2. Try using (supposedly) optimal Linear Algebra library - ArmPL

Baseline

Before get deep into it, let's make a baseline for comparison. For convenience's sake, let's refer this benchmark thread - ggerganov/llama.cpp#4167 . Thus we will use commit 8e672efe, and same c7g.xlarge instance as ARM article suggested.

Hardware support

Graviton 3 supports hardware-level instructions called SVE. It enables softwares to process multiple data at once, it is called SIMD (wikipedia.org). Intel-variant would be AVX2, AVX512. If you ever heard about ARM-based supercomputer Fugaku (wikipedia.org), SVE could be one reason why it has been 1st TOP 500 computer for 2 years.

Looking into default compilation options, from CMakeLists.txt and Makefile, they only cares about Raspberry Pi series, not powerful Graviton-series instances. Fortunately, it includes compilation flag to utilize NEON instructions, which could be considered as a predecessor of SVE.

ArmPL

Besides from hardware level support, llama.cpp supports BLAS instead of its own math implementations. BLAS stands for Basic Linear Algebra Subprograms, explaining itself.

So, what are options for BLAS?

For x86_64(or amd64) processors, Intel oneMKL(intel.com) is rule-of-thumb. It is compatible with most processors, including AMD CPUs, such as Ryzen, Epic series. As a circumstantial evidence, here's BLAS library PyTorch searches first. https://github.com/pytorch/pytorch/blob/v2.3.0/cmake/Modules/FindBLAS.cmake#L96-L106

But Intel oneMKL is not compatible with ARM processors. It's been a while Intel have been making ARM processors after discontinuing StrongARM and XScale processors.

AMD has their own BLAS (and more) implementation within AOCL suite. Obviously not compatible with ARM architecture.

So what else BLAS could be used?

Here's OpenBLAS. Actively maintained, and available on virtually every OS package managers. BLIS could be alternative if you looking for something new. Their performance benchmark (Performance.md) is worth reading. It shows Intel oneMKL is absolute best for Intel CPUs. If so, ARM's own math library could be best for ARM CPUs, right?

Besides from open source endeavors, ARM does have their own compilers(developer.arm.com) and math library suites ArmPL.

So it could be worthwhile to try compiling llama.cpp with ArmPL as a BLAS library for best performance.