ddemidov/vexcl

Support for Block SpMV

daaugusto opened this issue · 12 comments

Hi, first of all, thanks for the development and maintenance of this great library!
I would like to know if there is any plan to support block sparse matrices and block SpMV with constant-size blocks. This would speedup the kernel by reducing the indexing overhead for this class of matrices.

Thank you.

It is possible, but is not directly provided by the library.

There is an example of a block-valued spmv in vexcl backend for amgcl here:
https://github.com/ddemidov/amgcl/blob/master/amgcl/backend/vexcl_static_matrix.hpp

The code extends vexcl sparse matrix operations to support amgcl::static_matrix<T,N,N> as value type for vexcl matrices.

Thank you for your prompt reply, I will try to follow it. Do you have an estimate of the performance gain against the scalar version with regard to SpMV?

It should be faster, but not dramatically so. Here is a quick experiment of using amgcl to solve an elasticity problem which has a 4x4 block structure.

In the first case the matrix is treated as a scalar one, and pointwise aggregation coarsening (block_size=4) is employed

./solver_vexcl_cl -B -A problems/block4/A.bin -f problems/block4/b.bin -p \
  solver.{type=lgmres,tol=1e-4} \
  precond.relax.type=damped_jacobi \
  precond.coarsening.type=aggregation \
  precond.coarsening.aggr.eps_strong=0 \
  precond.coarsening.aggr.block_size=4
1. Tesla K40c (NVIDIA CUDA)

Solver
======
Type:             LGMRES(30,3)
Unknowns:         746944
Memory footprint: 216.57 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.28
Grid complexity:     1.26
Memory footprint:    890.87 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       746944       41921568    697.99 M (78.34%)
    1       173784       10569056    174.68 M (19.75%)
    2        19776         972896     16.37 M ( 1.82%)
    3         1712          51360      1.83 M ( 0.10%)

Iterations: 43
Error:      8.50538e-05

[Profile:       4.458 s] (100.00%)
[ self:         0.200 s] (  4.48%)
[  reading:     0.550 s] ( 12.33%)
[  setup:       2.289 s] ( 51.34%)
[  solve:       1.420 s] ( 31.85%)

In the second case the matrix is treated as block-valued with 4x4 blocks (-b4):

./solver_vexcl_cl -B -A problems/block4/A.bin -f problems/block4/b.bin -p \
  solver.{type=lgmres,tol=1e-4} \
  precond.relax.type=damped_jacobi \
  precond.coarsening.type=aggregation \
  precond.coarsening.aggr.eps_strong=0 \
  -b4
1. Tesla K40c (NVIDIA CUDA)

Solver
======
Type:             LGMRES(30,3)
Unknowns:         186736
Memory footprint: 216.57 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.05
Grid complexity:     1.05
Memory footprint:    451.76 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       186736        2620098    431.07 M (95.63%)
    1         8469         116745     19.28 M ( 4.26%)
    2          397           2977      1.41 M ( 0.11%)

Iterations: 47
Error:      9.0758e-05

[Profile:       3.016 s] (100.00%)
[ self:         0.208 s] (  6.89%)
[  reading:     0.557 s] ( 18.45%)
[  setup:       1.072 s] ( 35.56%)
[  solve:       1.179 s] ( 39.10%)

As you can see, the problem solution requires similar number of iterations (where each iteration consists of a number of spmv operations) in both cases, but the block-valued one is about 30% faster (the 'solve' line in profile). Also, the method is constructed ('setup') much faster in block-valued case, but that is done on the CPU side.

Hi Denis, with the help of AMGCL I was able to add support for block matrices in my application, thanks! Now I'm trying to inline a sequence of operations, which I can do using scalar matrices but not with block matrices:

This sequence (scalar):

   vex::vector<double> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   vex::vector<double> u( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   tmp = Z * x;
   u = D * tmp;
   y = W * u;

is inlined as:

   vex::vector<double> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   tmp = (D * vex::make_inline(Z * x));
   y = W * tmp;

where Z and W are of type vex::SpMat<double, int, cl_long>, and x, y and D are vex::vector<double>; n is the number of entries of the scalar matrix. (I wish I could inline it even further but I wasn't able to do so, but the current form already delivers about 60% of speed up).

However, with the block version, if I try to inline that sequence I get the error:

error: no matching function for call to ‘make_inline(std::enable_if<true, vex::sparse::matrix_vector_product<vex::sparse::ell<amgcl::static_matrix<double, 16, 16>, int, long int>, vex::vector<amgcl::static_matrix<double, 16, 1> > > >::type)’
                tmp = (D * vex::make_inline(Z * x));
                           ~~~~~~~~~~~~~~~~~^~~~~~

where:

Z and W are of type vex::sparse::ell<value_type_matrix, int, cl_long>>
D is vex::vector<value_type_matrix>
x and y are vex::vector<value_type_vector>
tmp is declared as vex::vector<value_type_vector> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
u is declared as vex::vector<value_type_vector> u( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );

value_type_vector = amgcl::static_matrix<double,BS,1>; // BS is the block size
value_type_matrix = amgcl::static_matrix<double,BS,BS>;

Also, in order to perform D * tmp with block matrices I have to do:

amgcl::backend::vmul( 1.0, D, tmp, 0.0, u );

Do you think it is possible to inline the block matrix version? If so, do you have any suggestion on how to accomplish that?

Thank you very much.

I think you should use vex::sparse::matrix instead of vex::SpMat to make everything inline (it does not require a make_inline call, operations with vex::sparse::matrix are inlined by default).

The block matrices that are implemented in amgcl with amgcl/backend/vexcl_static_matrix.hpp do support inlining. For example, the residual operation is completely inline here:

https://github.com/ddemidov/amgcl/blob/d6afcf7236801fc1726639b4fbad9f66a36ea61b/amgcl/backend/vexcl.hpp#L313

Hi Denis, I converted all the declarations to vex::sparse::matrix but I'm still getting an error with the block matrices. I have this:

std::unique_ptr<vex::sparse::matrix<value_type_matrix, int, int>> m_Z; // value_type_matrix = amgcl::static_matrix<double,3,3>
std::unique_ptr<vex::sparse::matrix<value_type_matrix, int, int>> m_W; // value_type_matrix = amgcl::static_matrix<double,3,3>
std::unique_ptr<vex::vector<value_type_matrix>> m_D; // Diagonal "matrix", value_type_vector = amgcl::static_matrix<double,3,1>

Then, I can compile the following:

 tmp = *m_D * (*m_Z * x); // tmp, x and y are vex::vector<value_type_vector>
 y = *m_W * tmp;

But when I run it I receive this error:

typedef struct { double data[3][3]; } amgcl_matrix_double_3x3;
typedef struct { double data[3][1]; } amgcl_matrix_double_3x1;

kernel void vexcl_vector_kernel
(
  ulong n,
  global amgcl_matrix_double_3x1 * prm_1,
  global amgcl_matrix_double_3x3 * prm_2,
  ulong prm_3_ell_width,
  ulong prm_3_ell_pitch,
  global int * prm_3_ell_col,
  global double * prm_3_ell_val,
  global int * prm_3_csr_ptr,
  global int * prm_3_csr_col,
  global double * prm_3_csr_val,
  global amgcl_matrix_double_3x1 * prm_3_x_1
)
{
  for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0))
  {
    amgcl_matrix_double_3x1 prm_3_sum;
    prm_3_sum.data[0][0] = 0;
    prm_3_sum.data[1][0] = 0;
    prm_3_sum.data[2][0] = 0;
    {
      ulong prm_3_ell_size = prm_3_ell_width * prm_3_ell_pitch;
      for(size_t j = 0; j < prm_3_ell_width; ++j)
      {
        int nnz_idx = idx + j * prm_3_ell_pitch;
        int c = prm_3_ell_col[nnz_idx];
        if (c != (int)(-1))
        {
          int idx = c;
          {
            amgcl_matrix_double_3x1 v = prm_3_x_1[idx];
            prm_3_sum.data[0][0] += prm_3_ell_val[0 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[1 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[2 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
            prm_3_sum.data[1][0] += prm_3_ell_val[3 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[4 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[5 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
            prm_3_sum.data[2][0] += prm_3_ell_val[6 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[7 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[8 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
          }
        } else break;
      }
      if (prm_3_csr_ptr)
      {
        ulong prm_3_csr_size = prm_3_csr_ptr[n];
        int csr_beg = prm_3_csr_ptr[idx];
        int csr_end = prm_3_csr_ptr[idx+1];
        for(int j = csr_beg; j < csr_end; ++j)
        {
          int idx = prm_3_csr_col[j];
          {
            amgcl_matrix_double_3x1 v = prm_3_x_1[idx];
            prm_3_sum.data[0][0] += prm_3_csr_val[0 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[1 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[2 * prm_3_csr_size + j] * v.data[2][0];                                      
            prm_3_sum.data[1][0] += prm_3_csr_val[3 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[4 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[5 * prm_3_csr_size + j] * v.data[2][0];                                      
            prm_3_sum.data[2][0] += prm_3_csr_val[6 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[7 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[8 * prm_3_csr_size + j] * v.data[2][0];                                      
          }
        }
      }
    }
    prm_1[idx] = ( prm_2[idx] * prm_3_sum );
  }
}
<kernel>:94:31: error: invalid operands to binary expression ('amgcl_matrix_double_3x3' and 'amgcl_matrix_double_3x1')
    prm_1[idx] = ( prm_2[idx] * prm_3_sum );
                   ~~~~~~~~~~ ^ ~~~~~~~~~

No problem running that with the scalar version, though.

The kernel looks like it is almost working for you. The problem is that OpenCL does not know how to multiply amgcl_matrix_double_3x3 and amgcl_matrix_double_3x1. Since OpenCL is C-based, you can not overload arithmetic operations, so you will need to provide a custom function that will do the multiplication. The resulting expression should look like

 tmp = mul(*m_D, (*m_Z * x));

The example of such a function implemented in a generic way can be found here:

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/amgcl/backend/vexcl_static_matrix.hpp#L733-L765

This is used in

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/amgcl/backend/vexcl_static_matrix.hpp#L851-L869

If you only need the 3x3 case, the function definition may be as simple as

VEX_FUNCTION(value_type_vector, mul, (value_type_matrix, a)(value_type_vector, x),
    amgcl_matrix_double_3x1 r;
    r[0][0] = a[0][0] * x[0] + a[0][1] * x[1] + a[0][2] * x[2];
    r[1][0] = a[1][0] * x[0] + a[1][1] * x[1] + a[1][2] * x[2];
    r[2][0] = a[2][0] * x[0] + a[2][1] * x[1] + a[2][2] * x[2];
    return r;
);

Hi Denis, thank you for your answer. Following the proposed approach, would the following work as well (gets rid of tmp)?

   y = *m_W * mul(*m_D,  (*m_Z * x));

And, if so, would it be more efficient than doing the calculation in two steps:

   tmp = mul(*m_D, (*m_Z * x));
   y = *m_W * tmp;

?

Thanks again.

m_W is a sparse matrix, right? This should work, but I think it would be less efficient because it would involve a lot of duplicated and badly cached memory reads. But it never hurts to check ;).

Yes, it is a sparse block matrix. I was hopping VexCL would magically fuse them into a single kernel... ;-) I'll check which is best. Thank you.

VexCL will fuse them into a single kernel, but I am afraid the kernel will be quite ineffective.

Hi Denis, I ended up using vex_mul directly instead of defining a custom VexCL function:

tmp = amgcl::backend::vex_mul<double, double, BS>().apply( *m_D, ( *m_Z * x ) );                                                                                                                                               
y = *m_W * tmp;

which leads to a speedup of 1.2 over doing three separate operations (tmp = *m_Z * x; u = *m_D * tmp; y = *m_W * u). Also, your predictions were correct; the following "one-line" works but is more than 10 times slower than the "two-line" version:

y = *m_W * amgcl::backend::vex_mul<double, double, BS>().apply( *m_D, ( *m_Z * x ) );

Thank you.