beatrixparis/connectivity-modeling-system

Bug when using more than 9 CPU cores

jiho opened this issue · 7 comments

jiho commented

using mpirun.mpich -np 9 ./cms *** works but mpirun.mpich -np 10 ./cms *** errors out with the message:

rename Error: No such file or directory

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

I strongly suspect this is because when n < 10, the output files are named traj_file_1.nc, traj_file_2.nc, etc. while, when n > 10, they are named traj_file_01.nc, etc. with two digits. At some point the code is trying to move traj_file_1.nc from SCRATCH to outputand that file does not exist.

jiho commented

For info, I have 48525 lines in my release file. n=9 works, n=10 does not (I think, I am running a simulation with n=9 now).

Hi Jean-Olivier,

So, I've just started getting this error as well - I don't think it's really related to the number of cores, since it happens randomly on my runs (I thought it was a pegasus issue...).

Are you using the version of the code available here, or the private one (with behavior, etc)? I only get this error with the private version, and I was going to check the output routines.

I always run the 1.1b version with 64 cores, no such issue popped up. Unless this version you were using had been modified, I don't think that's the reason.

jiho commented

I'm using the one with oriented swimming (which does not seem to be committed here -- maybe create a branch for this if you don't want it into master). I don't know where to find the version number (and my version is not git-ified so I don't have a revision number either).

I have recently downloaded CMS (Jan 2020) and have been working with a Gulf of Mexico model including a polygon file and a vertical migration matrix and have run into this error. If I run the model with just the polygon file on multiple threads I get the "rename Error: No such file or directory" error. If I do this with a single thread, but the mpirun cmd, there is no issue. If I add the vertical migration matrix and run it on a single thread, then I again get the "rename Error". It appears to occur when the model is transferring the data from the SCRATCH file to the output file with a single thread. I was not able to access the google group through the many links throughout the github. Is there a solution to this error? Thanks!

Sorry I don't have a solution to your issue right now (I'll let others answer that) but re:

I was not able to access the google group through the many links throughout the github. Is there a solution to this error?

You are right, some settings must have changed with the google group, but it should now be visible again. Thanks for letting us know.

You should now be able to see and search the forum, and if you want to post anything you will need to request to join with a google account (you don't need to give a reason) and we will enable your access as quickly as possible.

Here is the link again for convenience:
https://groups.google.com/forum/#!forum/connectivity-modeling-system-club

Thanks so much for the quick reply and for getting the google groups visible. I might be able to pull together a test set for the issue if you would like. Just let me know.