
OpenBLAS is suspiciously slow (wrt. BLIS/MKL on AMD)

OpenBLAS is suspiciously slow in numpy (order of magnitude slower than both BLIS and MKL, on an AMD 3950x!).


  • Create an MKL environment: conda create -n mkl numpy mkl
  • Create a BLIS environment: conda create -n blis numpy blis nomkl
  • Create an OpenBLAS environment: conda create -n openblas numpy openblas nomkl
  • Start a jupyter notebook/lab (in each environment, separately): $ OMP_NUM_THREADS=1 BLIS_NUM_THREADS=1 MKL_NUM_THREADS=1 jupyter lab
  • Run the following code to get timings:
import numpy as np
sizes = (1, 2, 3, 4, 32, 64, 127, 128, 129, 1023, 1024, 1025, 4096, 4096*2-1, 4096*2, 4096*2+1)
best_times = np.zeros(len(sizes))
for i, s in enumerate(sizes):
    arr = np.random.rand(s, s)
    arrT = np.random.rand(s, s)
    t = %timeit -o arr @ arrT
    best_times[i] =

I checked that CPU usage never exceeded 100.0 in top in all cases, throughout the full benchmark, until the very end.



Last point is around 25s in both MKL and BLIS; it is 3min30s in OpenBLAS. Last time I did something similar, OpenBLAS was on par with MKL. Again I insist: CPU usage was capped at 100% in all cases, there is no underlying multithreading here.

Conda environment

Environment (conda list):
$ conda list
openblas                  0.3.17          pthreads_h4748800_0    conda-forge

Full list here:

This is not the correct way. Please see our docs on how to switch blas implementation.

What are you talking about?

The point is not how to switch implementations in the most comfortable way (feel free to use whichever method you prefer to switch).

The point is about this OpenBLAS being much slower than BLIS, which is not how things used to be.

The point is not how to switch implementations in the most comfortable way

I didn't say it was comfortable or not. I said it's not correct which means it's wrong. conda list output you showed has the following,

libblas                   3.9.0           5_h92ddd45_netlib    conda-forge
libcblas                  3.9.0           5_h92ddd45_netlib    conda-forge

which means that you are not using openblas and using netlib's reference lapack which is slow. You have both netlib and openblas installed, but numpy is using the netlib one.

Please use the recommended way to switch blas implementation and you'll be able to get an environment where numpy uses openblas.

Why can't openblas require/pull the correct libblas?

Well at least I suppose this solves this specific bug request though it sounds like improper liblas versions should be made to conflict with mismatching BLAS implementations.