ORNL-Fusion/ips-wrappers

FRAMEWORK ERROR received a failure message from component ss31615_3@generic_driver@2

Closed this issue · 3 comments

dlg0 commented

@ORNL-Fusion/ips-support-team

The IPS is telling me it's having trouble with the following ...

Error executing command:  mpi_nubeam_comp_exec: init

Does that mean it is unable to find the binary? Or something to do with an "init" thing?

I expect the issue is related to it being unable to find the binary, but if so, I'd expect a full path to the file it tried to exectute, and not have that trailing "init" piece. I'm off to track down if it is the binary path, but if I'm on the wrong track, any pointers would be great.

Full log file and path shown below ...

greendl1@edison01:/project/projectdirs/atom/users/greendl1/diem_tsc_error3> cat this.log
2015-05-06 11:29:10,617 FRAMEWORK       WARNING  RM: listOfNodes = [('5689', '24'), ('5886', '24'), ('5930', '24'), ('5931', '24'), ('5932', '24'), ('6017', '24'), ('6117', '24'), ('6118', '24'), ('6121', '24'), ('6122', '24'), ('6123', '24'), ('6124', '24'), ('6125', '24'), ('6126', '24'), ('6127', '24')]
2015-05-06 11:29:10,618 FRAMEWORK       WARNING  RM: max_ppn = 24
2015-05-06 11:29:10,618 FRAMEWORK       WARNING  RM: User set accurateNodes to False
2015-05-06 11:29:10,620 FRAMEWORK       WARNING  Using user set procs per node: 24
2015-05-06 11:29:10,620 FRAMEWORK       WARNING  RM: 15 nodes and 24 processors per node
2015-05-06 11:29:35,766 FRAMEWORK       ERROR    received a failure message from component ss31615_3@generic_driver@2 : (Exception('Error executing command:  mpi_nubeam_comp_exec: init ',),)

@dlg0 Actually this is not coming from the IPS, it's coming from the code in the nubeam.py component (I know you don't appreciate the difference, but it DOES matter). The code says
if (retcode != 0):
e = 'Error executing command: mpi_nubeam_comp_exec: init '
print e
raise Exception(e)

So the call to mpi_nubeam_comp_exec failed in the init() method and that's how the component developer chose to report it.

@dlg0 And as for tracking the cause of error, the log file diem_tsc/work/fastran_nb_nubeam_7/log.nubeam (which captures the stdout from the call to the executable) has some useful info

dlg0 commented

@elwasif Excellent, thanks.

Certainly I appreciate that wrapper codes are likely to produce their own errors, or cause things to crash - and that that type of error is most always going to be the cause of the problem rather than the IPS proper.

However, tracking down the errors in non-intuitive. If you (as a human with vast experience) were simply able to figure out where the error occurred, perhaps you could teach the IPS to provide useful hints to the top-level log file so non-experts could know where to look. Here for example, provide the full path to the nubeam logfile (since the framework would presumably know that the nubeam component failed) as opposed to something indicating the generic_driver.