strange performances on cluster managed by slurm with openmpi
afalaize opened this issue · 2 comments
Hi,
First of all: thank you for DIPHA.
We have some unexpected performances when benchmarking the klein bottle example. We compiled dipha with gnu8/openmpi3/mpic++ and use slurm as our workload manager. Below is the output of a command similar to mpiexec -n $N dipha --benchmark --upper_dim 2 klein_bottle_pointcloud_new_400.txt.distmat.dipha klein_bottle_pointcloud_new_400.txt.distmat.dipha.out
with N=1 and then with N=28 cores. From this publication, it seems the expected decrease in the wall time is not reached. Do you have an idea on what we are missing?
Best regards,
Antoine
With N=1:
# cat slurm-958.out
Input filename:
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha
upper_dim: 2
Number of processes used:
1
Detailed information for rank 0:
time prior mem peak mem bytes recv
0.0s 42 MB 44 MB 0 MB complex.load_binary(input_filename, upper_dim);
Number of cells in input:
10667000
3.9s 43 MB 369 MB 0 MB get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
0.9s 125 MB 894 MB 162 MB get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
6.9s 482 MB 1124 MB 484 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
158.2s 886 MB 1206 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
0.1s 484 MB 1206 MB 2 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
0.0s 484 MB 1206 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
2.7s 403 MB 1206 MB 7 MB dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);
Overall running time in seconds:
172.8
Reduction kernel running time in seconds:
158.2
Overall peak mem in GB of all ranks:
1.2
Individual peak mem in GB of per rank:
1.2
Maximal communication traffic (without sorting) in GB between any pair of nodes:
0.6
Total communication traffic (without sorting) in GB between all pairs of nodes:
0.6
And with N=28:
# cat slurm-958.out
Input filename:
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha
upper_dim: 2
Number of processes used:
28
Detailed information for rank 0:
time prior mem peak mem bytes recv
0.1s 42 MB 45 MB 0 MB complex.load_binary(input_filename, upper_dim);
Number of cells in input:
10667000
0.2s 44 MB 61 MB 0 MB get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
0.0s 47 MB 68 MB 5 MB get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
0.6s 60 MB 129 MB 250 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
155.2s 114 MB 218 MB 1395 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
0.0s 103 MB 218 MB 1 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
0.0s 104 MB 218 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
0.2s 104 MB 218 MB 2 MB dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);
Overall running time in seconds:
156.5
Reduction kernel running time in seconds:
155.2
Overall peak mem in GB of all ranks:
0.3
Individual peak mem in GB of per rank:
0.2
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.2
0.1
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1
Maximal communication traffic (without sorting) in GB between any pair of nodes:
1.6
Total communication traffic (without sorting) in GB between all pairs of nodes:
11.4
Hi, thank you for this hint.
Actually, we get expected parallel performances with the --dual option.
Best regards,
Antoine