strange performances on cluster managed by slurm with openmpi

Question

strange performances on cluster managed by slurm with openmpi

afalaize opened this issue 3 years ago · 2 comments

Hi,
First of all: thank you for DIPHA.
We have some unexpected performances when benchmarking the klein bottle example. We compiled dipha with gnu8/openmpi3/mpic++ and use slurm as our workload manager. Below is the output of a command similar to mpiexec -n $N dipha --benchmark --upper_dim 2 klein_bottle_pointcloud_new_400.txt.distmat.dipha klein_bottle_pointcloud_new_400.txt.distmat.dipha.out with N=1 and then with N=28 cores. From this publication, it seems the expected decrease in the wall time is not reached. Do you have an idea on what we are missing?
Best regards,
Antoine

With N=1:

# cat slurm-958.out 

Input filename: 
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha

upper_dim: 2

Number of processes used: 
1

Detailed information for rank 0:
       time    prior mem     peak mem   bytes recv
       0.0s        42 MB        44 MB         0 MB   complex.load_binary(input_filename, upper_dim);

Number of cells in input: 
10667000
       3.9s        43 MB       369 MB         0 MB   get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
       0.9s       125 MB       894 MB       162 MB   get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
       6.9s       482 MB      1124 MB       484 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
     158.2s       886 MB      1206 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.1s       484 MB      1206 MB         2 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
       0.0s       484 MB      1206 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       2.7s       403 MB      1206 MB         7 MB   dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);

Overall running time in seconds: 
172.8

Reduction kernel running time in seconds: 
158.2

Overall peak mem in GB of all ranks: 
1.2

Individual peak mem in GB of per rank: 
1.2

Maximal communication traffic (without sorting) in GB between any pair of nodes:
0.6

Total communication traffic (without sorting) in GB between all pairs of nodes:
0.6

And with N=28:

# cat slurm-958.out 

Input filename: 
/home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha

upper_dim: 2

Number of processes used: 
28

Detailed information for rank 0:
       time    prior mem     peak mem   bytes recv
       0.1s        42 MB        45 MB         0 MB   complex.load_binary(input_filename, upper_dim);

Number of cells in input: 
10667000
       0.2s        44 MB        61 MB         0 MB   get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map);
       0.0s        47 MB        68 MB         5 MB   get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map);
       0.6s        60 MB       129 MB       250 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
     155.2s       114 MB       218 MB      1395 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.0s       103 MB       218 MB         1 MB   generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns);
       0.0s       104 MB       218 MB         0 MB   reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns);
       0.2s       104 MB       218 MB         2 MB   dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim);

Overall running time in seconds: 
156.5

Reduction kernel running time in seconds: 
155.2

Overall peak mem in GB of all ranks: 
0.3

Individual peak mem in GB of per rank: 
0.2
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.2
0.1
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1

Maximal communication traffic (without sorting) in GB between any pair of nodes:
1.6

Total communication traffic (without sorting) in GB between all pairs of nodes:
11.4

Answer 1 · 2021-11-27T12:50:47.000Z

Hi, Have you tried running with `--dual`? Best, Jan

…

On Wed, Nov 24, 2021 at 6:48 PM Antoine FALAIZE ***@***.***> wrote: Hi, First of all: thank you for DIPHA. We have some unexpected performances when benchmarking the klein bottle example. We compiled dipha with gnu8/openmpi3/mpic++ and use slurm as our workload manager. Below is the output of a command similar to mpiexec -n $N dipha --benchmark --upper_dim 2 klein_bottle_pointcloud_new_400.txt.distmat.dipha klein_bottle_pointcloud_new_400.txt.distmat.dipha.out with N=1 and then with N=28 cores. From this publication <https://arxiv.org/pdf/1506.08903.pdf>, it seems the expected decrease in the wall time is not reached. Do you have an idea on what we are missing? Best regards, Antoine With N=1: # cat slurm-958.out Input filename: /home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha upper_dim: 2 Number of processes used: 1 Detailed information for rank 0: time prior mem peak mem bytes recv 0.0s 42 MB 44 MB 0 MB complex.load_binary(input_filename, upper_dim); Number of cells in input: 10667000 3.9s 43 MB 369 MB 0 MB get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map); 0.9s 125 MB 894 MB 162 MB get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map); 6.9s 482 MB 1124 MB 484 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns); 158.2s 886 MB 1206 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns); 0.1s 484 MB 1206 MB 2 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns); 0.0s 484 MB 1206 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns); 2.7s 403 MB 1206 MB 7 MB dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim); Overall running time in seconds: 172.8 Reduction kernel running time in seconds: 158.2 Overall peak mem in GB of all ranks: 1.2 Individual peak mem in GB of per rank: 1.2 Maximal communication traffic (without sorting) in GB between any pair of nodes: 0.6 Total communication traffic (without sorting) in GB between all pairs of nodes: 0.6 And with N=28: # cat slurm-958.out Input filename: /home/ntaibi/data//distmat/dipha//klein_bottle_pointcloud_new_400.txt.distmat.dipha upper_dim: 2 Number of processes used: 28 Detailed information for rank 0: time prior mem peak mem bytes recv 0.1s 42 MB 45 MB 0 MB complex.load_binary(input_filename, upper_dim); Number of cells in input: 10667000 0.2s 44 MB 61 MB 0 MB get_filtration_to_cell_map(complex, dualize, filtration_to_cell_map); 0.0s 47 MB 68 MB 5 MB get_cell_to_filtration_map(complex.get_num_cells(), filtration_to_cell_map, cell_to_filtration_map); 0.6s 60 MB 129 MB 250 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns); 155.2s 114 MB 218 MB 1395 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns); 0.0s 103 MB 218 MB 1 MB generate_unreduced_columns(complex, filtration_to_cell_map, cell_to_filtration_map, cur_dim, dualize, unreduced_columns); 0.0s 104 MB 218 MB 0 MB reduction_kernel(complex.get_num_cells(), unreduced_columns, reduced_columns); 0.2s 104 MB 218 MB 2 MB dipha::outputs::save_persistence_diagram(output_filename, complex, filtration_to_cell_map, reduced_columns, dualize, upper_dim); Overall running time in seconds: 156.5 Reduction kernel running time in seconds: 155.2 Overall peak mem in GB of all ranks: 0.3 Individual peak mem in GB of per rank: 0.2 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Maximal communication traffic (without sorting) in GB between any pair of nodes: 1.6 Total communication traffic (without sorting) in GB between all pairs of nodes: 11.4 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD5NHWNS7COGO4I4Z7W2R43UNUQNTANCNFSM5IWVBQCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Answer 2 · 2021-12-02T11:14:19.000Z

Hi, thank you for this hint.
Actually, we get expected parallel performances with the --dual option.
Best regards,
Antoine