eddelbuettel/mkl4deb

benchmarkme fails

eduardszoecs opened this issue · 16 comments

I can't get this to run in a container. benchmarkmefails :(

sessionInfo(); library("benchmarkme"); benchmark_std(runs = 1)

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4
# Programming benchmarks (5 tests):
	3,500,000 Fibonacci numbers calculation (vector calc): 0.867 (sec).
	Grand common divisors of 1,000,000 pairs (recursion): 1.38 (sec).
	Creation of a 3500x3500 Hilbert matrix (matrix calc): 0.66 (sec).
	Creation of a 3000x3000 Toeplitz matrix (loops): 15.4 (sec).
	Escoufier's method on a 60x60 matrix (mixed): 1.61 (sec).
# Matrix calculation benchmarks (5 tests):
	Creation, transp., deformation of a 5000x5000 matrix: 1.15 (sec).
	2500x2500 normal distributed random matrix ^1000: 0.659 (sec).
	Sorting of 7,000,000 random values: 0.892 (sec).
	2500x2500 cross-product matrix (b = a' * a): 0.992 (sec).
Error in solve(crossprod(a), crossprod(a, b)) : 
  the leading minor of order 1421 is not positive definite
Calls: benchmark_std ... bm_matrix_cal_lm -> system.time -> solve -> solve -> .Call
Timing stopped at: 1.064 0.032 0.553

It's a ubuntu 16.04 container and here's the mkl part that I use:

# install mkl
RUN wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB -P /tmp \
    && apt-key add /tmp/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB \
    && sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' \
    && apt-get update \
    && apt-get install -y intel-mkl-64bit-2018.2-046 \
    && update-alternatives --install /usr/lib/x86_64-linux-gnu/libblas.so     \
                    libblas.so-x86_64-linux-gnu /opt/intel/mkl/lib/intel64/libmkl_rt.so 50 \
    && update-alternatives --install /usr/lib/x86_64-linux-gnu/libblas.so.3   \
                    libblas.so.3-x86_64-linux-gnu /opt/intel/mkl/lib/intel64/libmkl_rt.so 50 \
    && update-alternatives --install /usr/lib/x86_64-linux-gnu/liblapack.so   \
                    liblapack.so-x86_64-linux-gnu /opt/intel/mkl/lib/intel64/libmkl_rt.so 50 \
    && update-alternatives --install /usr/lib/x86_64-linux-gnu/liblapack.so.3 \
                    liblapack.so.3-x86_64-linux-gnu /opt/intel/mkl/lib/intel64/libmkl_rt.so 50 \
    && echo "/opt/intel/lib/intel64"     >  /etc/ld.so.conf.d/mkl.conf \
    && echo "/opt/intel/mkl/lib/intel64" >> /etc/ld.so.conf.d/mkl.conf \
    && ldconfig

Any ideas /hints? Can you share the Dockerfile you used?

openblas works fine:

sessionInfo(); library("benchmarkme"); benchmark_std(runs = 1)

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4
# Programming benchmarks (5 tests):
	3,500,000 Fibonacci numbers calculation (vector calc): 0.874 (sec).
	Grand common divisors of 1,000,000 pairs (recursion): 1.31 (sec).
	Creation of a 3500x3500 Hilbert matrix (matrix calc): 0.23 (sec).
	Creation of a 3000x3000 Toeplitz matrix (loops): 16.9 (sec).
	Escoufier's method on a 60x60 matrix (mixed): 1.28 (sec).
# Matrix calculation benchmarks (5 tests):
	Creation, transp., deformation of a 5000x5000 matrix: 1.31 (sec).
	2500x2500 normal distributed random matrix ^1000: 0.769 (sec).
	Sorting of 7,000,000 random values: 0.918 (sec).
	2500x2500 cross-product matrix (b = a' * a): 0.558 (sec).
	Linear regr. over a 3000x3000 matrix (c = a \ b'): 0.396 (sec).
# Matrix function benchmarks (5 tests):
	Cholesky decomposition of a 3000x3000 matrix: 0.773 (sec).
	Determinant of a 2500x2500 random matrix: 1.03 (sec).
	Eigenvalues of a 640x640 random matrix: 0.865 (sec).
	FFT over 2,500,000 random values: 0.536 (sec).
	Inverse of a 1600x1600 random matrix: 0.81 (sec).
     user system elapsed          test test_group cores
1   0.856  0.016   0.874           fib       prog     0
2   1.220  0.088   1.309           gcd       prog     0
3   0.176  0.056   0.230       hilbert       prog     0
4  16.824  0.032  16.947      toeplitz       prog     0
5   1.236  0.040   1.278     escoufier       prog     0
6   1.232  0.048   1.314         manip matrix_cal     0
7   0.768  0.000   0.769         power matrix_cal     0
8   0.892  0.024   0.918          sort matrix_cal     0
9   0.952  0.536   0.558 cross_product matrix_cal     0
10  0.848  0.468   0.396            lm matrix_cal     0
11  0.812  0.784   0.773      cholesky matrix_fun     0
12  1.516  0.384   1.027   determinant matrix_fun     0
13  0.936  1.608   0.865         eigen matrix_fun     0
14  0.536  0.000   0.536           fft matrix_fun     0
15  1.264  0.732   0.810       inverse matrix_fun     0

It's a good idea to try benchmarkme on this! I did nothing more than you -- fire up the script (I posted). If there is an issue, it is probably an MKL issue.

And I seem to get the same problem:

> benchmark_std(runs=1)
# Programming benchmarks (5 tests):
        3,500,000 Fibonacci numbers calculation (vector calc): 0.437 (sec).
        Grand common divisors of 1,000,000 pairs (recursion): 0.645 (sec).
        Creation of a 3500x3500 Hilbert matrix (matrix calc): 0.122 (sec).
        Creation of a 3000x3000 Toeplitz matrix (loops): 7.09 (sec).
        Escoufier's method on a 60x60 matrix (mixed): 0.559 (sec).
# Matrix calculation benchmarks (5 tests):
        Creation, transp., deformation of a 5000x5000 matrix: 0.435 (sec).
        2500x2500 normal distributed random matrix ^1000: 0.36 (sec).
        Sorting of 7,000,000 random values: 0.53 (sec).
        2500x2500 cross-product matrix (b = a' * a): 0.329 (sec).
Error in solve(crossprod(a), crossprod(a, b)) : 
  the leading minor of order 1404 is not positive definite
Timing stopped at: 1.037 0.009 0.175

@csgillespie Any ideas? That benchmark does not even seem influenced by set.seed().

Poking @csgillespie as first version of previous post had a typo with a # where a @ was needed...

And I did notice that we can switch to RcppZiggurat if installed (which I did). I then get

        2500x2500 cross-product matrix (b = a' * a): 0.324 (sec).
Error in solve(crossprod(a), crossprod(a, b)) : 
  the leading minor of order 1566 is not positive definite
Timing stopped at: 1.166 0.028 0.199
>

so it looks like the MKL is very picky. That is not really an issue for this repo though. The script to install the MKL works...

"picky" is in the mind of the beholder. I can't remember the details on the distribution of the eigenvalues of a matrix constructed in this way but it is extremely likely that a 2500 x 2500 cross-product matrix will be computationally singular. That's why solving such very large systems without any regularization is difficult.

Thanks so much for piping in. I was thinking about bugging you :) The actual routine appears to be
this function using a dgeMatrix from your Matrix package. What is weird that this is probably "old" code from the benchmark package originally put together by Simon. I am surprised this only bubbles up with MKL though/

Hmm, either a problem with the benchmark or mkl. What do Intel folks say (e.g. @emfomenk)? Are alternative benchmarks available ?

As Doug said, a 2000x2000 matrix crossproduct may well be singular.

Hi everyone,

Not sure I know what exactly you call from MKL, but it might happen we have a bug :) But if this is only matrix-matrix multiplication I really doubt the bug is in MKL itself...
Few things to check:

  • @eddelbuettel, what is threading model for MKL? MKL supports sequential, openmp and tbb runtimes. If OpenMP is used than there are 2 runtimes: GNU (libgomp.so) and Intel (libiomp5.so). You should not mix two runtimes in one application. If benchmarkme uses GNU OpenMP, you really want to either set MKL_THREADING_LAYER=GNU, to make mkl_rt pick gnu threading or do LD_PRELOAD=libiomp5.so (LD_LIBRARY_PATH should contain the path with libiomp5, that lives somewhere in /opt/intel). In former case libiomp5 would cover both runtimes.

  • you can set MKL_VERBOSE=1 to see what functions are called from MKL. that might help at least to understand what are the shapes and sizes you are using. Not sure that would immediately give me an idea what goes wrong.

Anyway, standalone reproducer would be super helpful for me to check on my side.

Thanks for coming over here, @emfomenk.

This repo 'merely' contains a script adding MKL to a .deb-based Debian or Ubuntu system. You can see in the rather short script what we do for ldconf: not much.

The particular failing function from the benchmarkme package is here -- the solve() after the two crossprod goes funny. Adding MKL_VERBOSE to R session I have for this (in Docker) made no difference.

ldd shows nothing pertaining to OpenMP but it may well dispatched by libmkl_rt.so:

root@c9f8062fbd93:~# ldd /opt/intel/mkl/lib/intel64/libmkl_rt.so
        linux-vdso.so.1 (0x00007ffc625f8000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9876403000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9876049000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9876c84000)
root@c9f8062fbd93:~# 

benchmarkme itself is just a set of R functions; R by default does not turn OpenMP on (but can).

Setting MKL_THREADING_LAYER=GNU before calling R made the difference:

root@c9f8062fbd93:~# MKL_THREADING_LAYER=GNU R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"              
Copyright (C) 2018 The R Foundation for Statistical Computing   
Platform: x86_64-pc-linux-gnu (64-bit)                         
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence()' for distribution details.    

  Natural language support but running in an English locale    

R is a collaborative project with many contributors.    
Type 'contributors()' for more information and      
'citation()' on how to cite R or R packages in publications. 

Type 'demo()' for some demos, 'help()' for on-line help, or  
'help.start()' for an HTML browser interface to help.      
Type 'q()' to quit R. 

> library(benchmarkme)
See https://jumpingrivers.shinyapps.io/benchmarkme/ for a Shiny
interface to the benchmark data.
> benchmark_std(runs=1)
# Programming benchmarks (5 tests):
        3,500,000 Fibonacci numbers calculation (vector calc): 0.426 (sec).
        Grand common divisors of 1,000,000 pairs (recursion): 0.616 (sec).
        Creation of a 3500x3500 Hilbert matrix (matrix calc): 0.125 (sec).
        Creation of a 3000x3000 Toeplitz matrix (loops): 7.28 (sec).
        Escoufier's method on a 60x60 matrix (mixed): 0.572 (sec).
# Matrix calculation benchmarks (5 tests):
        Creation, transp., deformation of a 5000x5000 matrix: 0.43 (sec).
        2500x2500 normal distributed random matrix ^1000: 0.367 (sec).
        Sorting of 7,000,000 random values: 0.535 (sec).
        2500x2500 cross-product matrix (b = a' * a): 0.078 (sec).
        Linear regr. over a 3000x3000 matrix (c = a \ b'): 0.101 (sec).
# Matrix function benchmarks (5 tests):
        Cholesky decomposition of a 3000x3000 matrix: 0.113 (sec).
        Determinant of a 2500x2500 random matrix: 0.084 (sec).
        Eigenvalues of a 640x640 random matrix: 0.186 (sec).
        FFT over 2,500,000 random values: 0.199 (sec).
        Inverse of a 1600x1600 random matrix: 0.072 (sec).
    user system elapsed          test test_group cores
1  0.411  0.016   0.426           fib       prog     0
2  0.584  0.032   0.616           gcd       prog     0
3  0.097  0.028   0.125       hilbert       prog     0
4  7.275  0.008   7.283      toeplitz       prog     0
5  3.292  0.039   0.572     escoufier       prog     0
6  0.406  0.024   0.430         manip matrix_cal     0
7  0.367  0.000   0.367         power matrix_cal     0
8  0.515  0.020   0.535          sort matrix_cal     0
9  0.354  0.056   0.078 cross_product matrix_cal     0
10 0.353  0.024   0.101            lm matrix_cal     0
11 0.316  0.016   0.113      cholesky matrix_fun     0
12 0.308  0.040   0.084   determinant matrix_fun     0
13 1.075  0.004   0.186         eigen matrix_fun     0
14 0.191  0.008   0.199           fft matrix_fun     0
15 0.402  0.004   0.072       inverse matrix_fun     0
>

I will ask @csgillespie to add that environment variable, or to document it. I will document it here too.

Thanks again!

I know I'm late to the party but two comments to wrap up:

  • As Dirk mentioned, I just used Simon's benchmarks (http://r.research.att.com/benchmarks/R-benchmark-25.R)

  • Be wary of comparing with historical benchmarks. Basically, the different versions of R uses the compiler package in different ways. I intend to update the package soon to make this clear.

No problem :)

Hmm... It is really strange MKL_VERBOSE doesn't trigger any verbose output. Currently all the functions from BLAS, LAPACK, and DFT should dump the parameters to stdout when MKL_VERBOSE environment variable is set. Really weird.

Regarding ldd. MKL supports 3 linking modes:

  1. static (libmkl_*.a)
  2. dynamic (libmkl_rt.so), rt stands for load dependencies at RunTime
  3. explicit dynamic (libmkl_*.so, except for libmkl_rt.so)

libmkl_rt.so loads required MKL layers at runtime. Depending on environment variables it might load C LP64/C ILP64/GNU Fortran LP64/GNU Fortran ILP64 interface, Intel OpenMP/GNU OpenMP/Intel TBB/sequential threading, and core library (architecture specific). For more information see this page.

If no environment variable is set the default configuration is C LP64 + Intel OpenMP threading layer. So at run-time MKL will load libiomp5.so and libmkl_intel_thread.so. If your application or its dependencies use GNU OpenMP (e.g. some parts of it is built with gcc and -fopenmp flags) then Linux dynamic linker will load GNU OpenMP RT as well. So at that moment your application will have 2 OpenMP run-times. That typically leads to numerical errors or even crashes.

Splendid explanation. I am the one building this R version for Debian (and Ubuntu) and we definitely have that enable in this build so we surely need the env var to not load it again:

root@11acb27fced5:~# grep -i openmp /etc/R/Makeconf 
DYLIB_LDFLAGS = -shared -fopenmp# $(CFLAGS) $(CPICFLAGS)
MAIN_LDFLAGS = -Wl,--export-dynamic -fopenmp
SHLIB_OPENMP_CFLAGS = -fopenmp
SHLIB_OPENMP_CXXFLAGS = -fopenmp
SHLIB_OPENMP_FCFLAGS = -fopenmp
SHLIB_OPENMP_FFLAGS = -fopenmp
root@11acb27fced5:~# 

So to recap MKL_THREADING_LAYER=GNU should protect us, correct?

So to recap MKL_THREADING_LAYER=GNU should protect us, correct?

Yes, as long as the whole application uses either GNU OpenMP RT or no OpenMP RT at all.

I can confirm that MKL_THREADING_LAYER=GNU works for me. Thanks for you help! The speedup compared to openblas is mainly visible in eigenvalues & fft.

Thanks to @Edild for raising this, and to @emfomenk for the excellent follow-up. I added this to both README.md and the actual script.sh.

Hi,

Thanks for the fix and work you do!
I have a minor comment to the commit though.
Please see here.