karpathy/llama2.c

Optimized code for matmul() works 3.5 faster (for Mac M1 Max with ARM NEON) ... and even more...

agershun opened this issue · 4 comments

I rewrote the code of the most critical function matmul() and it works faster in about 3.5-4 times (on Mac with ARM M1 Max) then the original. May be it will help to someone:

#include <arm_neon.h>

void matmul(float* xout, float* x, float* w, int n, int d) {
    for (int i = 0; i < d; i++) {
        float32x4_t val = vdupq_n_f32(0.0f); 
        
        for (int j = 0; j < n; j += 4) { 
               val = vaddq_f32(val, vmulq_f32(vld1q_f32(&x[j]), vld1q_f32(&w[i * n + j]))); 
        }
      
        xout[i] = vaddvq_f32(val);
    }
}

Probably the similar approach is possible for x86-64 with SSE (processing 4 floats) or AVS2 (processing 8 floats).

I turned off OpenMP.

The following code fpr Mac M1 Max from the Issue Use cblas for matrix multiplication #182 works even slighly faster (in 4.9 times on 15M network), but on larger networks the NEON method (above) and CBLAS versions (below) give the same performance x4 to the original matmul()) :

#include <Accelerate/Accelerate.h>
void matmul(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0f, w, n, x, 1, 0.0f, xout, 1);
}

I tested new function with llama2_7b.bin (25Gb file) it does not give such acceleration (I also has about 1 token/30 seconds), because the heavy disk usage. Probably, if it possible to fit the model in 32Gb memory and do not map it to the disk it will run significantly faster.

I rewrote the initiation function and download llama2_7b.bin into the memory. Then close all the programs. Now Llama-2 7B works with speed 1 token per 3 seconds on Mac M1 Max 32Gb. I will create a repository for the modified code.