libAtoms/QUIP

SYSTEM ABORT: Traceback (most recent call last) File "/io/QUIP/src/GAP/gp_predict.f95", line 485 kind unspecified

MES-physics opened this issue · 7 comments

Dear Developers,
What is the usual problem when this error occurs when running gap_fit?

SYSTEM ABORT: Traceback (most recent call last) File "/io/QUIP/src/GAP/gp_predict.f95", line 485 kind unspecified 
gpCoordinates_setParameters: negative value of n_xPrime = -1731515152 

Thanks for advice.

This probably indicates that your descriptor derivatives are so numerous that they cannot be expressed with a 4 byte integer and the variable showing as negative overflows. A while ago I created a branch to tackle this issue, but the patch is super ugly (you cam find it, it's called "long integers" or similar) and the MPI code is supposed to fix this behavior, although I'm not completely sure that it does ( does @albapa know?). You could check 1) running with more CPUs, 2) running with a smaller database or 3) decreasing your descriptor cutoff (the number of descriptor derivatives depends on the number of neighbors).

Yes, I agree with Miguel about the symptoms and the suggestions. Maybe with regards to point 1) I'd like to add that do not just run on more CPUs, but run it on more nodes - this is assuming that you are already using the MPI-parallelised gap_fit.

OK, thanks. I had been using 4 nodes already, and now tested it on 6, 8, and 9 nodes. The same error happened.
So I ran tests decreasing my cutoffs from 6 to 5.5 to 5.0 to 4.5 to 4.0, and got the same error.
So then the only thing left is the database size, where I have about 107 MBs. But I have made a GAP potential with about
101 MBs before with no problem.
Question: Is there some consensus on the maximum size of the database to be trained?

This was using 9 nodes, 576 threads:
libAtoms::Hello World: 26/03/2024 13:04:23
libAtoms::Hello World: git version https://github.com/libAtoms/QUIP,704e49f-dirty
libAtoms::Hello World: QUIP_ARCH linux_x86_64_gfortran_openmp
libAtoms::Hello World: compiled on Nov 18 2021 at 19:32:26
libAtoms::Hello World: OpenMP parallelisation with 576 threads
libAtoms::Hello World: OMP_STACKSIZE=1G
libAtoms::Hello World: Random Seed = 47063042
libAtoms::Hello World: global verbosity = 0

It looks like you're not using the MPI code, look at the sheer number of OpenMP threads you're using, I wonder how this is even possible. Are you only using one descriptor type? Perhaps a more complete output would help to debug this.

Here is the output file. I tried to reinstall the MPI version after not using GAP for awhile, but I suppose it wasn't correct?
Cours238-1718982.txt

Miguel is right again, this code is not MPI-compiled. You must choose an appropriate QUIP_ARCH - preferably one that has both opemp and mpi.

I doubt threading will do anything useful - see the performance figures in the recent GAP/mpi-gap-fit papers.

Sorry for the previous notes, I mistakenly didn't put in the new path to my MPI installation, so it was going to the serial version.
Thanks. But now that I did the right path, I'm getting segmentation fault. I'll ask in new issue.