IntelLabs/SpMP

SpMV BW changes from run to run

yubai0827 opened this issue · 16 comments

The following is the result from 7 runs:

========== Run #1 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 14.72 gflops 126.22 gbps
MKL SpMV BW 5.22 gflops 44.76 gbps
MKL inspector-executor SpMV BW 15.73 gflops 134.91 gbps
========== Run #2 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 9.83 gflops 84.28 gbps
MKL SpMV BW 5.77 gflops 49.50 gbps
MKL inspector-executor SpMV BW 15.26 gflops 130.88 gbps
========== Run #3 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 10.12 gflops 86.76 gbps
MKL SpMV BW 5.24 gflops 44.95 gbps
MKL inspector-executor SpMV BW 15.33 gflops 131.50 gbps
========== Run #4 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 19.35 gflops 165.99 gbps
MKL SpMV BW 4.83 gflops 41.45 gbps
MKL inspector-executor SpMV BW 13.83 gflops 118.65 gbps
========== Run #5 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 19.23 gflops 164.88 gbps
MKL SpMV BW 5.47 gflops 46.89 gbps
MKL inspector-executor SpMV BW 14.64 gflops 125.59 gbps
========== Run #6 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 10.02 gflops 85.93 gbps
MKL SpMV BW 3.66 gflops 31.35 gbps
MKL inspector-executor SpMV BW 10.10 gflops 86.60 gbps
========== Run #7 ============
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 19.28 gflops 165.37 gbps
MKL SpMV BW 5.83 gflops 49.97 gbps
MKL inspector-executor SpMV BW 15.30 gflops 131.19 gbps

This is running: test/reordering_test

The matrix is webbase-1M:
https://sparse.tamu.edu/Williams/webbase-1M

So is it normal that BW changes from run to run dramatically? And sometimes MKL BW is higher and sometimes MKL BW is lower. Are those expected or not? If not, do I miss something?

Many thanks,
Yu Bai

Can you tell me specification of the machine and the compiler you used? Did you specify thread affinity as in the comment of the tests? Single-thread performance should be more stable. Since now Skylake and Cascadelake has more cache capacity with non-inclusive LLC, you may want to increase LLC_CAPACITY defined in SpMP/test.hpp and consider using clflush instruction instead of just updating a large array (see https://github.com/pytorch/FBGEMM/blob/master/bench/BenchUtils.h#L42 as an example).

Just curious. Why SpMV performance after reordering not printed?

First, thank you!

machine.log
Above is log of "cat /proc/cpuinfo".

make.log
Above is make log.

No, I don't explicitly specify thread count. "reordering_test matrix" is the command.

Do you mean that the small LLC capacity might introduce performance variance from run to run? Now LLC_CAPACITY definition is as below:
static const size_t LLC_CAPACITY = 32x1024x1024;
How large is more appropriate?

I commented out the remaining the codes so nothing is printed out after reordering.

"A->multiplyWithVector(y, x);" is where matrix-vector multiplication happens, isn't it?

Thanks again.

I'd start with a single-thread runs with OMP_NUM_THREADS=1 and please also specify thread affinity to make it more consistent. Then, you can increase the number of threads but I'd stay within a single socket. Please try with 4x of the current LLC_CAPACITY but I think the number of threads and affinity have the biggest impact.

Thank you. How to specify thread affinity? I notice in the comments:

OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1 test/reordering_test web-Google.mtx

Can you please confirm that the matrix-vector multiplication is actually happening in the function: "A->multiplyWithVector(y, x);"? So if we only study the matrix-vector multiplication, we can safely comment out the remaining of reordering_test codes? Can we? Thank you!

Thank you. How to specify thread affinity? I notice in the comments:

OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1 test/reordering_test web-Google.mtx

KMP_AFFINITY=granularity=fine,compact,1 is a reasonable setup to use (using 1 thread per physical core).
For single thread run, use OMP_NUM_THREADS=1, and then you can increase up to the number of physical cores in a socket. For more details, please Google KMP_AFFINITY and I'm sure experts inside Intel know much more about this than me :)

Yes, that's where SpMV actually happens. But, in the later code, SpMV is executed again after reordering and that can give you better performance depending on sparsity pattern (MKL inspector-executor also can do reordering so for a fair comparison you'd want to compare with the performance after reordering).

Many thanks. I have tried to run after reordering and get seg fault with OMP_NUM_THREADS=1 as follows:

OMP_NUM_THREADS=1 KMP_AFFINITY=granularity=fine,compact,1 reordering_test webbase-1M.bin
m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000
original bandwidth 987649
SpMV BW 0.92 gflops 7.86 gbps
MKL SpMV BW 0.86 gflops 7.42 gbps
MKL inspector-executor SpMV BW 1.24 gflops 10.63 gbps

BFS reordering
Constructing permutation takes 0.038738 (0.32 gbps)
0 missed
0 duplicated
Permute takes 0.021677 (2.46 gbps)
Permuted bandwidth 964586
SpMV BW 1.87 gflops 16.03 gbps

RCM reordering w/o source selection heuristic
Segmentation fault

===================
BFS reordering does help improve performance: 1.24 gflops to 1.87 gflops. RCM reodering causes seg fault.

Ignoring the segfault, are you seeing more consistent performance across runs using single thread?

I think the segfault is because BFS/RCM reordering only works for symmetric matrices. If you build with DBG=yes option, you will see assertions like the following. Sorry about the poor error handling.
reordering_test: reordering/RCM.cpp:976: void SpMP::CSR::getBFSPermutation(int*, int*): Assertion isSymmetric(false)' failed.`

Actually reordering_test is trying to make the input matrix symmetric as you can see from https://github.com/IntelLabs/SpMP/blob/master/test/reordering_test.cpp#L88 but apparently forceSymmetric is ignored when we're loading from *.bin file. For reordering_test, please use *.mtx file.

I added 2 more commits. You should be able to see an error message when you try to run reordering_test with *.bin files.

Yes, with the single OMP thread, the performance (gflops) is much lower and stable though not perfect. Many thanks.

Can you please confirm my understanding that if .bin (asymmetric, like webbase-1M.bin) matrix is used with reordering_test, forceSymmetric is ignored, but matrix-vector multiplication is still done correctly, but maybe not ideal performance level? I mean before you made the last two commits. Appreciate your help.

Can you please confirm my understanding that if .bin (asymmetric, like webbase-1M.bin) matrix is used with reordering_test, forceSymmetric is ignored, but matrix-vector multiplication is still done correctly, but maybe not ideal performance level? I mean before you made the last two commits. Appreciate your help.

Yes even with *.bin as inputs, matrix-vector multiplication before reordering is still done correctly.

This is great, thanks again.

I am curious if OMP_NUM_THREADS is not explicitly defined, what is it? Does it depend on workloads? Does it change from run to run? Thanks.

I am curious if OMP_NUM_THREADS is not explicitly defined, what is it? Does it depend on workloads? Does it change from run to run? Thanks.

By default it's typically the number of logical cores available to the system unless you limit process's access via numactl for example. BTW, for general questions not related to SpMP, please ask to other forums or to Intel OpenMP team.

I got it, thank you!