eddelbuettel/mkl4deb

MKL threading

kkm000 opened this issue · 3 comments

First of all, thank you for the update-alternatives trick! I had no idea that libmkl_rt.so can stand for all four libraries. It's just WOW! A fantastic trick!

I'm a little bit worried about your unconditional recommendation to use MKL_THREADING_LAYER=GNU. I've been doing numerical stuff with MKL for many years, and (while I normally went with TBB threading, as it seems to have the least lock contention) the results are mixed. Sometimes it's a big gain on one architecture, and slows down the computation on another. My most surprising moment was when I ran the same program on an MS Server (2012 R2, IIRC), and on a Hyper-V VM on the same host, with the same OS, same number of CPU and a lot of RAM (probably less that the host, but close). I've got a big gain on the VM, but unexpected drop in speed on the host (yes, not the other way round!)

I suggest maybe adding a paragraph that the users should try also MKL_THREADING_LAYER=SEQUENTIAL and see which one is actually better for their problem would be helpful. It's really both problem-dependent and hardware-dependent. The LinPACK benchmark always benefits from threading, in my experience, so it should be the real code for their task that they test the performance on. My impression is MKL does not always makes the best choice as to whether shard the computation between threads or not.

FWIW, the default for MKL_THREADING_LAYER is intel.

The details now escape me :-/ but IIRC this goes back to a tip I got from an Intel engineer from one of their labs -- it may have been in the context of RcppParallel (which uses TBB). I don't recall now - I seem to remember that it was related to using the GNU toolchain (as common in Linux) along with the MKL (rather than Intel's compiler, IIRC).

Edit: Maybe it was this thread: #2. And more here: c6d8d84. I suggest you talk more to Intel.

I think what the script is doing is exactly right, provided that the users are familiar with the peculiarities of MKL threading: changing the default from the multithreaded iomp5 to, multithreaded again, GOMP-based implementation. On the other hand, it may have a surprising effect on performance if someone is using it simply as an alternative to generic single-threaded BLAS libraries. MKL_THREADING_LAYER=sequential (no automatic multithreading) may potentially be a safer default for the general user for two reasons: (1) it does not load any OMP libraries into the process, so no possible conflicts like in #2 with GCC -fopenmp, and (2) automatic multithreading with MKL is not necessarily always the best choice.

It's possible that I'm confused, not understanding the intended audience for this script. I was just too impressed with your update-alternatives idea not to read the whole script, and thought maybe sharing this tidbit would be helpful. Again, sorry if you feel I'm just wasting your time.

@kkm000 Thanks for the information, MKL_THREADING_LAYER=sequential is faster than default is faster than MKL_THREADING_LAYER=GNU on my machine.