MIC Spmv "CSR Vector" operation performance is very bad compared to "CSR Scalar"
rothpc opened this issue · 0 comments
In an attempt to provide implementations of SpMV comparable to those used in the CUDA and OpenCL versions, the SpMV "CSR Vector" operations have been implemented using OpenMP nested parallelism. The outer loop is parallelized using a conventional "omp parallel" directive, the inner loop with an "omp parallel for" directive plus a reduction clause. The number of threads used for the inner and outer threads is specified using a num_threads clause, and dynamic thread count management is turned off. This is intended to mimic the CUDA/OpenCL version's use of a reduction that fits within a single warp.
However, the performance of the "CSR Vector" version is very poor compared to the "CSR Scalar" version that simply parallelizes the outer loop. The performance changes with the number of inner and outer threads. The number of inner loop threads must be small, because the number of non-zeros in each row is relatively small (probably too small to overcome the OpenMP overhead) even with size 4 problems. Surprisingly, performance is better with a small number of outer loop threads also, which leads to the question about whether we are implementing the nested parallelism correctly.