charmplusplus/charm

ckLocal() call on bound array causing charmrun to abort

adrianplenefisch opened this issue · 4 comments

Issue with version 7.0.0 of charm++.

When using ./charmrun +p 2 ./FoFApp from paratreet, the ckLocal() call in src/Partition.h at line 528 causes to a segmentation fault. It's called on an element of a chare array that is bound to another array, and the issue comes when it reaches the last index. This only happens when run with more than one PE.

When Charm is compiled without "--with-production", the if(n.idx==0) is true at line 62 in charm/src/ck-core/init.h and the CkAbort("Group ID is zero-- invalid!\n") gets run.

To reproduce:

Clone https://github.com/adrianplenefisch/unionfind and https://github.com/adrianplenefisch/paratreet into the same parent directory. When cloning paratreet, use:

git clone --recurse-submodules https://github.com/adrianplenefisch/paratreet

cd into unionfind and checkout fof_test, then modify the first two lines of Makefile.common to give the absolute path to the charm directory and the unionfind directory itself.

Go into the paratreet directory. Checkout add-friends-of-friends. cd into src and modify line 8 of Makefile.common to be the path to the charm directory. Then cd ../examples and run ./make_everything.sh.

An input file can be found here:

https://drive.google.com/drive/folders/1t865aYCTKqeeFOUrOJyEb7d61TLpnzO3?usp=sharing

After copying that into the examples directory, the error can be reproduced by running:

./charmrun +p 2 ./FoFApp -f cube300.000128 -v output_prefix -pbc -px 1 -py 1 -pz 1 -ll 0.00417

lvkale commented

We have been exploring this (and also trying to construct test programs to reproduce it). No success so far. Question: did it work on previous versions of Charm+?

Thank you for looking into this. Unfortunately, I don't think this was ever running on a previous version of charm, so I'm not sure.

Update: I tried reproducing the bug on my machine using a multicore build and I did not get this bug; this only happens on netlrts/mpi builds. It looks like there is a race condition issue that's related to the dependencies between the bounded array elements; I am still working on figuring out if the race condition is caused by Charm++ code or the application code.

This has turned out to to be simply because the proxy variable was uninitialized, and not from any issue within charm. However, I've opened a new issue regarding how the charm abort messages could have been more helpful in diagnosing this bug.

#3775

Thanks again to those who looked into this.