KarypisLab/ParMETIS

pm_dglpart crashes when partition large graph

Rhett-Ying opened this issue · 6 comments

when running with pm_dglpart graph_name 3, app crashes with below error messages. In default, only one process is used. also tried to run with mpirun -np 3 pm_dglpart graph_name 1, crashes with similar error. The graph used here has 40 millions nodes and 1 billion edges. What weird is it works well if partition into 3 parts in multiple machines: mpirun --hostfile hostfile -np 3 pm_dglpart graph_name 1.

Fatal error in PMPI_Irecv: Invalid count, error stack: PMPI_Irecv(156): MPI_Irecv(buf=0x7f053dd7d270, count=-1666761083, MPI_LONG_LONG_INT, src=0, tag=2, comm=0x84000001, request=0x56256094e354) failed PMPI_Irecv(98).: Negative count, value is -1666761083 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=940158978 : system msg for write_line failure : Bad file descriptor

Same issue is hit when try to partition large graph(90 million nodes, 2.1 billion edges, duplicate edges have been removed) even across machines with below command: mpirun --hostfile hostfile -np 3 pm_dglpart order 1 > mpirun.txt. output log is attached below:

[000] gnvtxs: 90154684, gnedges: 2187227098, ncon: 2
[002] gnvtxs: 90154684, gnedges: 2187227098, ncon: 2
[001] gnvtxs: 90154684, gnedges: 2187227098, ncon: 2
DistDGL partitioning, ncon: 1, nparts: 3
[90154684 4374454196 30051561 30051562] [75] [ 0.000] [ 0.000]
[52761479 1507894642 17415606 17860974] [75] [ 0.000] [ 0.000]
[31362468 502622234 10278059 10573813] [75] [ 0.000] [ 0.000]
[18800390 179236166 6149654 6449312] [75] [ 0.000] [ 0.000]
[11274515 66957764 3720288 3797784] [75] [ 0.000] [ 0.000]
[6769665 26756718 2223140 2275340] [75] [ 0.000] [ 0.000]
[4074760 11584056 1337096 1376568] [75] [ 0.000] [ 0.000]
[2463706 5468262 807502 838460] [75] [ 0.000] [ 0.000]
[1502923 2849056 489056 509709] [75] [ 0.000] [ 0.000]
[929574 1633872 303728 317006] [75] [ 0.000] [ 0.000]
[588483 1030756 188656 202724] [75] [ 0.000] [ 0.000]
[385642 705874 124422 132633] [75] [ 0.000] [ 0.000]
[264434 516548 83510 92146] [75] [ 0.000] [ 0.000]
[192510 399100 60471 67260] [75] [ 0.000] [ 0.000]
[148927 322372 45867 53129] [75] [ 0.000] [ 0.000]
nvtxs: 148927, cut: 35529, balance: 1.022
nvtxs: 192510, cut: 38218, balance: 1.020
nvtxs: 264434, cut: 36884, balance: 1.020
nvtxs: 385642, cut: 38213, balance: 1.020
nvtxs: 588483, cut: 63618, balance: 1.019
nvtxs: 929574, cut: 181466, balance: 1.019
nvtxs: 1502923, cut: 357107, balance: 1.018
nvtxs: 2463706, cut: 684784, balance: 1.018
nvtxs: 4074760, cut: 1133225, balance: 1.017
nvtxs: 6769665, cut: 1885520, balance: 1.015
nvtxs: 11274515, cut: 3397704, balance: 1.012
nvtxs: 18800390, cut: 6486145, balance: 1.004
nvtxs: 31362468, cut: 11542999, balance: 0.994
nvtxs: 52761479, cut: 15894608, balance: 0.991
nvtxs: 90154684, cut: 22395036, balance: 0.870
Setup: Max: 322.227, Sum: 966.679, Balance: 1.000
Matching: Max: 168.185, Sum: 504.554, Balance: 1.000
Contraction: Max: 509.616, Sum: 1528.846, Balance: 1.000
InitPart: Max: 0.142, Sum: 0.426, Balance: 1.001
Project: Max: 5.706, Sum: 10.191, Balance: 1.680
Initialize: Max: 65.075, Sum: 194.924, Balance: 1.002
K-way: Max: 63.641, Sum: 190.907, Balance: 1.000
Remap: Max: 0.071, Sum: 0.214, Balance: 1.001
Total: Max: 1060.819, Sum: 3182.457, Balance: 1.000
Final 3-way Cut: 22395036 Balance: 0.870
~

Is there a way that you can share the input files so that to debug the issue?

require files for reproduce have been sent in email. pls check the mail box. thanks.

I pushed some changes in the ParMetis code that takes advantage of the "_c" API routines of MPI 4.0 to deal with the 32 bit int count limit of earlier MPI versions. I tested it using MPICH 4.0rc1. Give it a try and see if this fixes the issue.

I tried with pm_dglpart magA 3, mpirun -np 10 pm_dglpart magA 1, both works file. magA is MAG240M(https://ogb.stanford.edu/docs/lsc/mag240m/) and duplicate edges are removed before partition.

part of stdout:
[000] gnvtxs: 244160499, gnedges: 3454471824, ncon: 4
DistDGL partitioning, ncon: 3, nparts: 3 [i64, r32, MPI 4.0]

Great.