runomp on Mac M1 Max is slower than runfast
tairov opened this issue ยท 10 comments
Recently I did extensive benchmarks of llama2.c
ports
I found that C version in runfast
mode (singlethreaded) is working faster than runomp
(multi threaded)
make runomp CC=/opt/homebrew/opt/llvm/bin/clang; OMP_NUM_THREADS=5 ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 529.976019
VS
make runfast; ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 657.738095
Does anyone have insights into why this might be happening?
I recently incorporated multithreading into my Zig port of this project and made some relevant findings. Essentially, the overhead associated with initializing and terminating multiple threads per matrix-vector multiplication can compromise efficiency with smaller 'tinystory' models.
Specifically, with a single M1 Pro performance core, I am able to achieve up to 724.59 tok/s on the 15M model. However, employing 5 threads for the multiplications drops the performance down to 225.971 tok/s. Although the OMP implementation is likely to be more sophisticated, and possibly reuses threads, it appears that it faces similar challenges.
In comparison, applying multithreading on the Llama 2 7B model nearly doubles performance, as the vectors in this case are significantly larger. Consequently, the overhead of thread spawning becomes negligible.
Hi @clebert , thanks for your comment.
Do you mean 724 tok/s
achieved in single-threaded mode ? If so, it looks amazing !
I was thinking it somehow spin up threads in the background
Yes in single-threaded mode. But this was my best ever measured run. Normally, it fluctuates between 680 and 700 tokens per second. Why there is this big variance, I don't know.
@clebert do you know which Zig features contributed the most to the overall performance? Alignment, SIMD?
According to the extensive benchmark other llama2 implementations fluctuate. That's why better run them in rounds.
The use of @Vector
(SIMD) had the biggest effect. Without SIMD, you couldn't get anywhere near these results. Aligning the vectors to the cache line, on the other hand, did not have the effect I had hoped for. If at all hardly measurable. Even though I only measured manually and not so systematically.
I forgot to mention one important optimization: @setFloatMode(.Optimized)
It has about the same effect as setting -ffast-math
in the C version.
@tairov I have conducted extensive benchmarks with my improved Zig implementation, using an Apple M2 Pro equipped with 12 cores and an Apple M1 Pro equipped with 10 cores.
In summary,
The 15M model presents its fastest performance in single-threaded mode. For the 42M/110M models, they both present their fastest performance on the M2 Pro with the use of 7 extra threads, and on the M1 Pro with the use of 5 extra threads.
I noticed that for your benchmarks, you seemed to have opted for 5 threads on an Apple M1 Max, which, from a CPU perspective, is identical to the M1 Pro 8/2. Have you found the use of 5 threads also increases performance with other implementations such as C++ and Mojo?
Hey @clebert , I really appreciate you taking the time to improve llama2.zig
, I think the ziglang community & maintainers might get valuable insights from it.
Just curios , what means workers = 0
๐
Yes, I've gotten the best results with 5 threads for the cpp, mojo & c implementations
Frankly speaking, for determining the number of threads I haven't used multiple rounds to find the best one, just ran inference few times in CLI.
And thank you for sharing results for different worker counts. I believe it will help me improve my benchmarking methodology as well. Now I'll try reproducing comparisons amongst leading llama2 implementations by varying the threads factor
Hey @clebert , I really appreciate you taking the time to improve
llama2.zig
, I think the ziglang community & maintainers might get valuable insights from it.
Thank you ๐๐ป
Just curios , what means
workers = 0
๐
If the count of workers is set to zero, all computations will be performed single-threaded within the main thread. Beginning with a single worker, the matrix-vector multiplication is distributed into additional threads. It is subsequently divided into sections, or chunks of rows, equivalent to the number of workers. Therefore, the performance with a single worker โ or extra thread โ is expected to be the poorest. This is because there are no gains, only the additional overhead of synchronization.
Yes, I've gotten the best results with 5 threads for the cpp, mojo & c implementations Frankly speaking, for determining the number of threads I haven't used multiple rounds to find the best one, just ran inference few times in CLI.
As my tests have shown, both with the 42M model and with the 110M model, the version with 5 additional threads aka Workers on the M1 is the fastest. I'm curious whether it remains the same for Mojo. Then this seems to be the sweet spot, although the type of paralellization is certainly completely different...
I don't think openmp is creating and destroying threads often. gdb shows several created at start and none created after that.
If you run a bigger model, like llama2 7B, you will benefit from parallelizing matmul with openmp. The time complexity isn't worth the the overhead until you get to a big model.