Performance with AMGCL
kai-lan opened this issue · 1 comments
I hope to tune my AMGCL for better performance. For context, I am trying to solve large symmtric Poisson matrix with 3189488 unknowns: https://drive.google.com/drive/folders/1_igHqFW3HDUReTsp7Ve6bba0zuZbpbSl?usp=sharing.
I used the same code as the Poisson3D tutorial, but I changed the solver from bicgstab
to cg
. The following is the benchmarking:
- for built in:
Solver
======
Type: CG
Unknowns: 3189488
Memory footprint: 97.34 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.61
Grid complexity: 1.12
Memory footprint: 1.09 G
level unknowns nonzeros memory
---------------------------------------------
0 3189488 22125328 835.16 M (62.13%)
1 372675 11877413 248.19 M (33.35%)
2 22244 1545008 27.62 M ( 4.34%)
3 1032 66324 2.89 M ( 0.19%)
Iters: 9
Error: 7.57968e-05
[poisson3Db: 7.403 s] (100.00%)
[ read: 6.383 s] ( 86.22%)
[ setup: 0.749 s] ( 10.12%)
[ solve: 0.264 s] ( 3.56%)
- for cuda:
NVIDIA RTX 6000 Ada Generation
Matrix ../../test_data/A.mtx: 3189488x3189488
RHS ../../test_data/b.mtx: 3189488x1
Solver
======
Type: CG
Unknowns: 3189488
Memory footprint: 97.34 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.61
Grid complexity: 1.12
Memory footprint: 849.05 M
level unknowns nonzeros memory
---------------------------------------------
0 3189488 22125328 637.82 M (62.13%)
1 372675 11877413 187.52 M (33.35%)
2 22244 1545008 20.80 M ( 4.34%)
3 1032 66324 2.90 M ( 0.19%)
Iters: 9
Error: 7.57968e-05
[poisson3Db: 7.428 s] (100.00%)
[ self: 0.062 s] ( 0.83%)
[ read: 6.308 s] ( 84.93%)
[ setup: 1.024 s] ( 13.78%)
[ solve: 0.034 s] ( 0.46%)
- If only considering setup and solve time, why is CUDA version slower than built-in?
- Either way, it takes about 1 sec to solve this system, but with CUDA CG (https://cupy.dev/) it only takes half a second. I hope we can tune up AMGCL?
The setup step in amgcl is always performed on the CPU. When a GPU is used, there is an additional overhead of moving the constructed hierarchy to the GPU memory. So the setup on a GPU always takes more time than a setup on the CPU.
You could try to solve the system using a simple single-level preconditioner (for example, CG+SPAI0, using amgcl::relaxation::as_preconditioner<Backend, Relaxation>). The solution step should be more expensive than with AMG, but the setup would be much cheaper, so you could win overall.