ddemidov/amgcl

A consultation about ILU0

PM-Hu opened this issue · 27 comments

PM-Hu commented

Hi Denis

I have tested the elasticity problem using the solver:

// namely solver1
typedef amgcl::make_solver<
        amgcl::amg<
            PBackend,
            amgcl::coarsening::smoothed_aggregation,
            amgcl::relaxation::ilu0
            >,
        amgcl::solver::cg<SBackend>
        > Solver;

solver1 is solved suited to my problem, and it converges after:

AMGIters: 10
AMGError: 4.73559e-07
[  AMGsetup:         0.285 s]
[  AMGsolve:         4.578 s]

Its convergence rate is much greater than that of the PCG (preconditioned with ILU0, default) on MATLAB:

% namely solver2
[L,U] = ilu(K);
   x  = pcg(K,F,1e-6,1000,L,U);

solver2 converges after 91 iterations, but it cost 0.56s (wall time).

It seems slover2 is more efficient than solver1.

Then I choose the ILU(0) as a preconditioner of CG on AMGCL using the code:

// namely solver3
typedef amgcl::make_solver<
        amgcl::relaxation::as_preconditioner<
        PBackend,
        amgcl::relaxation::ilu0
        >,
        amgcl::solver::cg<SBackend>
        > PCG_Solver; 

solver3 converges after:

PCGIters: 104
PCGError: 8.77976e-09
[  PCGsetup:         0.055 s]
[  PCGsolve:        23.976 s]

solver3 spends more time than solver2.
and I also tested the CG on AMGCL, it converges after 924 iterations:

CGIters: 924
CGError: 9.21561e-07
[  CGsetup:          0.005 s]
[  CGsolve:          1.373 s]

it was close to the CG on MATLAB, which converges after 922 iterations and costs 1.75s. The CG preconditioned with saAMG and spai0 cost 0.397s and it converges after 71 iterations.


  • Am I right about the configuration and the finding?
  • Then if ILU(0) should be realized in a more efficient way? And should it be worthy to do?

--- Thanks, have a nice day

How do you define SBackend, PBackend, and do you provide near-null-space components to the amgcl constructor?

Also, did you enable full optimization when you compiled amgcl solvers?

PM-Hu commented

The Backend is defined by:

typedef amgcl::backend::builtin<double> SBackend; // the solver backend
typedef amgcl::backend::builtin<double> PBackend; // the preconditioner backend

I have provided the near-null-space mode using the node coordinates.

-o3
-fopenmp

is used.

PM-Hu commented

Here is my [.mtx] file and code
matrix.zip

Your parameters look ok to me, but I get much lower timings on my (not quite new) machine from your code (my CPU is Intel Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz):

[Viscoelastic2D:     4.205 s] (100.00%)
[  AMGsetup:         0.253 s] (  6.01%)
[  AMGsolve:         0.118 s] (  2.82%)
[  CGsetup:          0.004 s] (  0.09%)
[  CGsolve:          1.369 s] ( 32.55%)
[  PCGsetup:         0.062 s] (  1.48%)
[  PCGsolve:         0.351 s] (  8.35%)
[  RSsetup:          0.225 s] (  5.35%)
[  RSsolve:          0.714 s] ( 16.98%)
[  read:             1.108 s] ( 26.34%)

I've made a couple of fixes to your code: https://gist.github.com/ddemidov/76291758a043252a23eec44b211560c5/revisions

First, you don't have to specify aggregation block size in case you are providing nullspace vectors, and second, there seems to be a typo in the parameter definition for the PCG solver. With the fixes the results are:

[Viscoelastic2D:     4.099 s] (100.00%)
[  AMGsetup:         0.170 s] (  4.16%)
[  AMGsolve:         0.117 s] (  2.86%)
[  CGsetup:          0.004 s] (  0.09%)
[  CGsolve:          1.375 s] ( 33.54%)
[  PCGsetup:         0.062 s] (  1.50%)
[  PCGsolve:         0.307 s] (  7.49%)
[  RSsetup:          0.224 s] (  5.46%)
[  RSsolve:          0.714 s] ( 17.41%)
[  read:             1.126 s] ( 27.47%)

Note that the solution for the AMG solver is about 3 times faster than the solution for the PCG solver, and the PCG solver is more than 4 times slower than unpreconditioned CG solver. This does not agree with the relative performance of the solvers you showed above, so I would check the compilation flags again. I have included my Makefile with your code on the link above.

Using plain aggregation instead of smoothed aggregation is slightly more expensive in terms of solution time, but takes less time to setup, so is overall slightly faster:

[Viscoelastic2D:     4.052 s] (100.00%)
[  AMGsetup:         0.099 s] (  2.45%)
[  AMGsolve:         0.128 s] (  3.15%)
[  CGsetup:          0.004 s] (  0.09%)
[  CGsolve:          1.372 s] ( 33.85%)
[  PCGsetup:         0.061 s] (  1.50%)
[  PCGsolve:         0.301 s] (  7.43%)
[  RSsetup:          0.229 s] (  5.66%)
[  RSsolve:          0.744 s] ( 18.36%)
[  read:             1.112 s] ( 27.45%)
PM-Hu commented

Hi
I just delete the option -fopenmp then the results are:

[Viscoelastic2D:    24.116 s] (100.00%)
[ self:              0.070 s] (  0.29%)
[  AMGsetup:         0.338 s] (  1.40%)
[  AMGsolve:         0.132 s] (  0.55%)
[  CGsetup:          0.006 s] (  0.02%)
[  CGsolve:          2.009 s] (  8.33%)
[  PCGsetup:         0.030 s] (  0.12%)
[  PCGsolve:         0.513 s] (  2.13%)
[  RSsetup:          0.261 s] (  1.08%)
[  RSsolve:          0.877 s] (  3.64%)
[  read:            19.880 s] ( 82.43%)

I'm not sure if OpenMP is correctly installed, I will check your revised carefully.

This could mean you have some other process on your machine that occuplies one or more CPU cores. This makes the core oversubscribed when you also run the openmp amgcl test, and the performance suffers.

Unrelated to the openmp issue: another possibility to speed up the computation is to use the block valued backend (amgcl::backend::builtin<amgcl::static_matrix<double, 2, 2>>), and not providing the nullspace components. Unfortunately, using nullspace vectors here is not an option, since in 2D there are 3 nullspace components, which makes matrices at the lower levels of amg hierarchy have 3x3 block structure, which is incompatible with the 2x2 block structure of the original matrix. This considerably speeds up both the setup and the solution times:

solver -A MatCoord.mtx -f RHS.mtx solver.type=cg solver.tol=1e-6 precond.relax.type=ilu0 precond.coarsening.aggr.eps_strong=0 -b2
Solver
======
Type:             CG
Unknowns:         16900
Memory footprint: 1.03 M

Preconditioner
==============
Number of levels:    2
Operator complexity: 1.04
Grid complexity:     1.04
Memory footprint:    40.06 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        16900         414736     37.19 M (96.43%)
    1          676          15376      2.87 M ( 3.57%)

Iterations: 14
Error:      8.3769e-07

[Profile:       1.264 s] (100.00%)
[  reading:     1.129 s] ( 89.31%)
[  setup:       0.042 s] (  3.36%)
[  solve:       0.092 s] (  7.29%)
PM-Hu commented

WOW
Is that means prm.precond.coarsening.aggr.block_size = 2; is incompatible in lower level which has 3x3 block structure.

Indeed, I close the other windows, and it has not improved the solution time.

Is that means prm.precond.coarsening.aggr.block_size = 2; is incompatible in lower level which has 3x3 block structure.

Yes, that is part of the reason you should not set the block_size with nullspace components.

Indeed, I close the other windows, and it has not improved the solution time.

Can you try my Makefile? This should work:

AMGCL_ROOT=/path/to/amgcl make
PM-Hu commented

prm.precond.coarsening.aggr.block_size = 2

the paper [Algebraic Multigrid by Smoothed Aggregation for Second and Fourth Order Elliptic Problems, P. Vank, 1996] has mentioned that:

When there is more than one degree of freedom per node, the coarsening algorithm as described in Section 5.1 is likely to produce aggregates of physically incompatible degrees of freedom causing deterioration of convergence. This happens, e.g., for elements with derivatives as degrees of freedom, and for systems such as elasticity. The obvious remedy is to treat all degrees of freedom associated with a node as a block and the scalar operations on degree of freedom by their block counterparts for nodes.

I will reread that.

Can you try my Makefile?

I have some trouble with the use of Cmake, and I left it behind. I will try this as soon as I can.

Thanks for your help, I have learned a lot of things about AMG and AMGCL.

I have some trouble with the use of Cmake, and I left it behind. I will try this as soon as I can.

The makefile above does not use cmake, it is simply equivalent to:

g++ -o nullspace nullspace.cpp -I/path/to/amgcl -O3 -fopenmp -DNDEBUG
PM-Hu commented

I have tried mingw32-make Makefile

mingw32-make: *** No rule to make target 'Makefile'.  Stop.

Makefile is:

nullspace: nullspace.cpp
	g++ -o $@ $^ -I$(F:\AMGCL\amgcl-master) -fopenmp -O3 -DNDEBUG

I'm not familiar with it. I will copy my result as soon as I can handle it.

Try just running mingw32-make. It does not need an argument, and will use the Makefile in the current folder.

PM-Hu commented

Hi
I have tried mingw32-make using the code, and it results:

[Viscoelastic2D:    83.957 s] (100.00%)
[  AMGsetup:         0.169 s] (  0.20%)
[  AMGsolve:         5.025 s] (  5.99%)
[  CGsetup:          0.004 s] (  0.00%)
[  CGsolve:          1.292 s] (  1.54%)
[  PCGsetup:         0.053 s] (  0.06%)
[  PCGsolve:        20.561 s] ( 24.49%)
[  RSsetup:          0.234 s] (  0.28%)
[  RSsolve:         49.574 s] ( 59.05%)
[  read:             6.974 s] (  8.31%)

Makefile is:

amgcl_root := F:\AMGCL\amgcl-master
boost_root := C:\Boost\include\boost-1_78

nullspace: nullspace.cpp
		g++ -o $@ $^ -I${amgcl_root} -I${boost_root} -O3 -fopenmp -DNDEBUG

Then I delete the -fopenmp option, it results:

[Viscoelastic2D:    10.824 s] (100.00%)
[ self:              0.063 s] (  0.58%)
[  AMGsetup:         0.194 s] (  1.79%)
[  AMGsolve:         0.140 s] (  1.29%)
[  CGsetup:          0.006 s] (  0.06%)
[  CGsolve:          1.942 s] ( 17.94%)
[  PCGsetup:         0.030 s] (  0.28%)
[  PCGsolve:         0.368 s] (  3.40%)
[  RSsetup:          0.256 s] (  2.37%)
[  RSsolve:          0.889 s] (  8.21%)
[  read:             6.936 s] ( 64.08%)

These results are similar to my first concern.

I have tested the openmp support using:

#include <iostream>
using namespace std;
int main()
{
    #if _OPENMP 
        cout << " support openmp " << endl;
    #else 
        cout << " not support openmp" << endl;
    #endif 
    return 0;
}

it outputs support openmp.


I will check the openmp configuration as soon as I can.

Thanks for your patience and kind help.

Hello, maybe this is unrelated. Some time ago we did a profiling of our code using VTune (compiled in win with VS) and found that ILU0 spends most of the time idle in the omp barrier.

image

Thanks, @jrubiogonzalez , my guess is this means the CPU is undersubscribed: not enough parallelism in ILU0 to fill all the cores, or the work is badly balanced. But here it looks like MinGW implementation of GCC has some performance issues with openmp.

PM-Hu commented

Hi
I have tested the code with a different number of Threads, here is my data:

4 Threads:

[Viscoelastic2D:    36.159 s] (100.00%)
[  AMGsetup:         0.191 s] (  0.53%)
[  AMGsolve:         1.861 s] (  5.15%)
[  CGsetup:          0.004 s] (  0.01%)
[  CGsolve:          1.224 s] (  3.39%)
[  PCGsetup:         0.064 s] (  0.18%)
[  PCGsolve:         7.313 s] ( 20.22%)
[  RSsetup:          0.260 s] (  0.72%)
[  RSsolve:         18.200 s] ( 50.33%)
[  read:             7.007 s] ( 19.38%)

8 Threads:

[Viscoelastic2D:    60.747 s] (100.00%)
[  AMGsetup:         0.177 s] (  0.29%)
[  AMGsolve:         3.460 s] (  5.70%)
[  CGsetup:          0.004 s] (  0.01%)
[  CGsolve:          1.374 s] (  2.26%)
[  PCGsetup:         0.054 s] (  0.09%)
[  PCGsolve:        14.247 s] ( 23.45%)
[  RSsetup:          0.245 s] (  0.40%)
[  RSsolve:         34.135 s] ( 56.19%)
[  read:             7.014 s] ( 11.55%)

12 Thread:

[Viscoelastic2D:    85.765 s] (100.00%)
[  AMGsetup:         0.169 s] (  0.20%)
[  AMGsolve:         5.075 s] (  5.92%)
[  CGsetup:          0.006 s] (  0.01%)
[  CGsolve:          1.386 s] (  1.62%)
[  PCGsetup:         0.052 s] (  0.06%)
[  PCGsolve:        21.298 s] ( 24.83%)
[  RSsetup:          0.232 s] (  0.27%)
[  RSsolve:         50.547 s] ( 58.94%)
[  read:             6.964 s] (  8.12%)

24 Threads:

[Viscoelastic2D:   158.039 s] (100.00%)
[  AMGsetup:         0.174 s] (  0.11%)
[  AMGsolve:         9.831 s] (  6.22%)
[  CGsetup:          0.005 s] (  0.00%)
[  CGsolve:          1.637 s] (  1.04%)
[  PCGsetup:         0.046 s] (  0.03%)
[  PCGsolve:        40.506 s] ( 25.63%)
[  RSsetup:          0.245 s] (  0.16%)
[  RSsolve:         98.540 s] ( 62.35%)
[  read:             7.019 s] (  4.44%)

1 Thread:

[Viscoelastic2D:    11.164 s] (100.00%)
[ self:              0.045 s] (  0.40%)
[  AMGsetup:         0.215 s] (  1.93%)
[  AMGsolve:         0.143 s] (  1.28%)
[  CGsetup:          0.006 s] (  0.05%)
[  CGsolve:          2.060 s] ( 18.45%)
[  PCGsetup:         0.031 s] (  0.28%)
[  PCGsolve:         0.384 s] (  3.44%)
[  RSsetup:          0.265 s] (  2.37%)
[  RSsolve:          0.902 s] (  8.08%)
[  read:             7.113 s] ( 63.71%)

The solution time is proportional to the threads, and more threads cost more times. Yet I have no idea about the reason, I will continue the test.

Can you run the same tests with spai0 instead of ilu0?

PM-Hu commented

Hi
Sorry about the late. Here is my test data:

spiIters: 56
spiError: 8.74424e-07

1 Threads:
[  spisetup:         0.181 s] (  2.36%)
[  spisolve:         0.489 s] (  6.38%)

2 Threads:
[  spisetup:         0.147 s] (  1.93%)
[  spisolve:         0.363 s] (  4.77%)

4 Threads:
[  spisetup:         0.134 s] (  1.81%)
[  spisolve:         0.334 s] (  4.50%)

8 Threads:
[  spisetup:         0.130 s] (  1.72%)
[  spisolve:         0.334 s] (  4.43%)

12 Threads:
[  spisetup:         0.127 s] (  1.70%)
[  spisolve:         0.338 s] (  4.52%)

24 Threads:
[  spisetup:         0.126 s] (  1.71%)
[  spisolve:         0.338 s] (  4.58%)

It seems the solution time is related to the threads when the number is less than 4. But using more threads doesn't reduce the solution time more.

This looks better than ILU0. As a workaround, you can disable openmp solution in ILU0 by setting prm.precond.relax.solve.serial = true.

What CPU model do you have? Are there really 24 hardware cores, or do you have HyperThreading enabled? HT usually does not help in compute-heavy tasks.

AMGCL performance is mostly bounded by the memory bandwidth. From your data it looks like your CPU has 2-4 memory channels, and hence the solution does not scale beyond 2-4 threads. Also, you will probably see better scaling with a larger problem.

PM-Hu commented

Ha 😂, my CPU is Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, 6 cores and 12 threads. It looks like has 2 memory channels.

The tests of 24 threads are meanless.

Should I change my compiler? The convergence rate using ilu0 is greater than using spai0 (11 versus 56), which is attractive to me. So far, ilu0 without openmp is acceptable for me.

You could try MS compiler from Visual Studio. I think their community edition is freely available to download.

PM-Hu commented

openmp option in Visual Studio 2022 is working for ilu0, here are the results:

Iters: 12
Error: 8.30579e-09

openmp: OFF
[poisson3Db:     6.170 s] (100.00%)
[ self:          0.008 s] (  0.13%)
[  read:         4.951 s] ( 80.24%)
[  setup:        0.570 s] (  9.24%)
[  solve:        0.641 s] ( 10.39%)

openmp: ON
[poisson3Db:     6.150 s] (100.00%)
[ self:          0.008 s] (  0.14%)
[  read:         5.139 s] ( 83.57%)
[  setup:        0.572 s] (  9.30%)
[  solve:        0.430 s] (  6.99%)

But the solution parts of this code:

prof.tic("AMGsolve"); 
std::tie(iters, error) = solve(A, rhs, x);  \\ bypassed
prof.toc("AMGsolve");

will be ignored in VS 2022 with the same configuration as above:

AMGIters: 0
AMGError: 0

Warning! Profile is incomplete.
[Viscoelastic2D:     3.686 s] (100.00%)
[  AMGsetup:         0.000 s] (  0.00%)
[  AMGsolve:         0.000 s] (  0.00%)
[  read:             3.686 s] (100.00%)

So strange 😂.

Warning! Profile is incomplete.

The warning means that there is a tic() without the corresponding toc(). Check that you did not accidentally delete part of the code.

PM-Hu commented

The same code works good with g++, but in VS 2022, an error occurs:

nullspace.cpp(52): Error c4700: uninitialized local variable “cols”

It is out of the scope of this issue. Thanks for your kind help.