ddemidov/amgcl

Performance with AMGCL

kai-lan opened this issue · 1 comments

I hope to tune my AMGCL for better performance. For context, I am trying to solve large symmtric Poisson matrix with 3189488 unknowns: https://drive.google.com/drive/folders/1_igHqFW3HDUReTsp7Ve6bba0zuZbpbSl?usp=sharing.

I used the same code as the Poisson3D tutorial, but I changed the solver from bicgstab to cg. The following is the benchmarking:

  • for built in:
Solver
======
Type:             CG
Unknowns:         3189488
Memory footprint: 97.34 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.61
Grid complexity:     1.12
Memory footprint:    1.09 G

level     unknowns       nonzeros      memory
---------------------------------------------
    0      3189488       22125328    835.16 M (62.13%)
    1       372675       11877413    248.19 M (33.35%)
    2        22244        1545008     27.62 M ( 4.34%)
    3         1032          66324      2.89 M ( 0.19%)

Iters: 9
Error: 7.57968e-05

[poisson3Db:     7.403 s] (100.00%)
[  read:         6.383 s] ( 86.22%)
[  setup:        0.749 s] ( 10.12%)
[  solve:        0.264 s] (  3.56%)
  • for cuda:
NVIDIA RTX 6000 Ada Generation
Matrix ../../test_data/A.mtx: 3189488x3189488
RHS ../../test_data/b.mtx: 3189488x1
Solver
======
Type:             CG
Unknowns:         3189488
Memory footprint: 97.34 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.61
Grid complexity:     1.12
Memory footprint:    849.05 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0      3189488       22125328    637.82 M (62.13%)
    1       372675       11877413    187.52 M (33.35%)
    2        22244        1545008     20.80 M ( 4.34%)
    3         1032          66324      2.90 M ( 0.19%)

Iters: 9
Error: 7.57968e-05

[poisson3Db:     7.428 s] (100.00%)
[ self:          0.062 s] (  0.83%)
[  read:         6.308 s] ( 84.93%)
[  setup:        1.024 s] ( 13.78%)
[  solve:        0.034 s] (  0.46%)
  1. If only considering setup and solve time, why is CUDA version slower than built-in?
  2. Either way, it takes about 1 sec to solve this system, but with CUDA CG (https://cupy.dev/) it only takes half a second. I hope we can tune up AMGCL?

The setup step in amgcl is always performed on the CPU. When a GPU is used, there is an additional overhead of moving the constructed hierarchy to the GPU memory. So the setup on a GPU always takes more time than a setup on the CPU.

You could try to solve the system using a simple single-level preconditioner (for example, CG+SPAI0, using amgcl::relaxation::as_preconditioner<Backend, Relaxation>). The solution step should be more expensive than with AMG, but the setup would be much cheaper, so you could win overall.