Welcome to mmperf

mmperf is a single core GEMM benchmark. This repository aims to benchmark Matrix Multiply (SGEMM) hand-tuned libraries and code generation stacks on a single thread on one CPU core. The focus will be on machine learning workloads so FP32 or smaller and irregular sizes of matrices. The goal is to expose high performance atomic kernels that can then be used to build highly efficient higher level implemenations spanning multiple cores or distributed across systems when efficient atomic kernels are asynchrously scheduled with overlapping communicaitons (interchip, in a system or across a system).

Engineered Libraries:

  • Intel MKL
  • OpenBLAS
  • RUY
  • Accelerate
  • BLIS

Compiler / Codegen kernels

  • MLIR
  • Halide
  • TVM
  • Nod.AI

Results

Results on Nvidia A100 (cublas vs SHARK)

Results

Results on Intel Alderlake 12900k (AVX2)

Results

Results on Intel XEON Skylake (iMAC PRO, AVX512)

Results

Results on Xeon Cascade Lake (GCP C2 instance, AVX 512)

Results

Results on Xeon Cascade Lake Codegen TVM, Halide, MLIR (GCP C2 instance, AVX 512)

Results

Results on AMD Ryzen 5950x (ZenV3, compared to AMD's BLIS and OpenBLAS for RESNET50 sizes)

Results

Results on Intel XEON E-2276M Coffee lake (Thinkpad P53, AVX2)

Results

Results on Apple M1 (NEON - no AMX2)

Note: 8GB Mac Mini runs roughly 25% slower than the 16GB version on other tests. Results

Code

For more details see mmperf on Github.

Support or Contact

mmperf aims to be a collaborative effort though primarily developed by nod.ai so if you can get better performance or add a new backend please submit a PR.