Error while parsing the ext files in PsN

Question

Error while parsing the ext files in PsN

gauravgupta26091987 opened this issue 2 years ago · 2 comments

gauravgupta26091987 commented 2 years ago

Hi Team,

We are facing strange issue where while running the jobs we get below warning message on cmd line.
This is mainly something ,we have noticed with long-running NONMEM jobs or long-running PsN procedures with many jobs.

warning : while parsing the ext files the following error was encountered no table column with index 0 at opt/psn/psn_5_3_0/subproblem.pm line 2775.
screenshot as below :-

Job gets completed without any issues however we see that end results missing some outputs in .ext files. As example, below output of ext files from two different jobs. first output is correct job execution whereas second is from job with this issue. So, you see the rows after 25(the ones with -1000000000 or higher in the first column) are missing in second output.

These jobs are running on AWS parallelcluster(slurm scheduler) with shared storage as FsX lustre.
One of the possible reasons which we think, could be an issue with the compute nodes sometimes losing the connection to FsX (just for a short while)? this could explain why we see the warning and lack of file updates, while the job (which runs in memory on the compute nodes) keeps running (and trying to write to the files). We will try to replicate the issue and see if this is due to any connection issue between compute nodes and Storage file share (FsX lustre).
But still we would like to understand if anybody has faced this kind of issue before and is this related to PsN/NONMEM OR what could be other possible reasons for this?
@rikardn - looking forward to your support as we are bit clueless about this issue. :)
Any help would be really appreciated!
Thanks in advance.

Answer 1 · 2023-06-26T13:13:50.000Z

I have seen similar issues before. The first thing I would test is to see if the run would give the same results run using plain nmfe. Preferrably on a local machine. If you see the same issue it is only related to NONMEM.

Answer 2 · 2023-06-27T08:10:11.000Z

@rikardn Hi Rikard,
Thanks for the information, you were absolutely right.
When we executed the job from local machine, it's getting executed correctly with complete output, so looks like issue with Shared storage on AWS parallel cluster as that's where we are storing all the data and program files. We will investigate in that direction.
Thanks