DIPHA/dipha

strange performances on cluster managed by slurm with openmpi

afalaize opened this issue · 2 comments

Hi,
First of all: thank you for DIPHA.
We have some unexpected performances when benchmarking the klein bottle example. We compiled dipha with gnu8/openmpi3/mpic++ and use slurm as our workload manager. Below is the output of a command similar to mpiexec -n $N dipha --benchmark --upper_dim 2 klein_bottle_pointcloud_new_400.txt.distmat.dipha klein_bottle_pointcloud_new_400.txt.distmat.dipha.out with N=1 and then with N=28 cores. From this publication, it seems the expected decrease in the wall time is not reached. Do you have an idea on what we are missing?
Best regards,
Antoine

With N=1:

# cat slurm-958.out 

Input filename: 
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha

upper_dim: 2

Number of processes used: 
1

Detailed information for rank 0:
       time    prior mem     peak mem   bytes recv
       0.0s        42 MB        44 MB         0 MB   complex.load_binary(input_filename, upper_dim);

Number of cells in input: 
10667000
       3.9s        43 MB       369 MB         0 MB   get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
       0.9s       125 MB       894 MB       162 MB   get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
       6.9s       482 MB      1124 MB       484 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
     158.2s       886 MB      1206 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.1s       484 MB      1206 MB         2 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
       0.0s       484 MB      1206 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       2.7s       403 MB      1206 MB         7 MB   dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);

Overall running time in seconds: 
172.8

Reduction kernel running time in seconds: 
158.2

Overall peak mem in GB of all ranks: 
1.2

Individual peak mem in GB of per rank: 
1.2

Maximal communication traffic (without sorting) in GB between any pair of nodes:
0.6

Total communication traffic (without sorting) in GB between all pairs of nodes:
0.6

And with N=28:

# cat slurm-958.out 

Input filename: 
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha

upper_dim: 2

Number of processes used: 
28

Detailed information for rank 0:
       time    prior mem     peak mem   bytes recv
       0.1s        42 MB        45 MB         0 MB   complex.load_binary(input_filename, upper_dim);

Number of cells in input: 
10667000
       0.2s        44 MB        61 MB         0 MB   get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
       0.0s        47 MB        68 MB         5 MB   get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
       0.6s        60 MB       129 MB       250 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
     155.2s       114 MB       218 MB      1395 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.0s       103 MB       218 MB         1 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
       0.0s       104 MB       218 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.2s       104 MB       218 MB         2 MB   dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);

Overall running time in seconds: 
156.5

Reduction kernel running time in seconds: 
155.2

Overall peak mem in GB of all ranks: 
0.3

Individual peak mem in GB of per rank: 
0.2
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.2
0.1
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1

Maximal communication traffic (without sorting) in GB between any pair of nodes:
1.6

Total communication traffic (without sorting) in GB between all pairs of nodes:
11.4

Hi, thank you for this hint.
Actually, we get expected parallel performances with the --dual option.
Best regards,
Antoine