ORNL-Fusion/ips-wrappers

FRAMEWORK ERROR Error staging plasma state files to directory

dlg0 opened this issue · 4 comments

dlg0 commented

@ORNL-Fusion/ips-support-team

In the run below I'm trying to do a TSC based simulation using an initial plasma state created using the fastran_init instead of the minimal_init_state for the reason of wanting to do the TSC / FASTRAN comparison in a true modular way, and certainly using the same initial plasma state and the same wrappers and binaries.

/project/projectdirs/atom/users/greendl1/diem_tsc_error8

I'm getting the following error referring to the staging of the PRIOR_STATE = ${SIM_NAME}_psp.nc file of the plasma-state set for the TSC port.

greendl1@edison12:/project/projectdirs/atom/users/greendl1/diem_tsc_error8> cat this.log 
2015-05-08 17:34:48,906 FRAMEWORK       WARNING  RM: listOfNodes = [('6109', '24'), ('6110', '24'), ('6111', '24'), ('6112', '24'), ('6113', '24'), ('6114', '24'), ('6115', '24'), ('6116', '24'), ('6117', '24'), ('6118', '24'), ('6119', '24'), ('6120', '24'), ('6121', '24'), ('6122', '24'), ('6123', '24')]
2015-05-08 17:34:48,907 FRAMEWORK       WARNING  RM: max_ppn = 24 
2015-05-08 17:34:48,907 FRAMEWORK       WARNING  RM: User set accurateNodes to False
2015-05-08 17:34:48,908 FRAMEWORK       WARNING  Using user set procs per node: 24
2015-05-08 17:34:48,909 FRAMEWORK       WARNING  RM: 15 nodes and 24 processors per node
2015-05-08 17:35:00,831 FRAMEWORK       ERROR    Error staging plasma state files to directory /global/project/projectdirs/atom/users/greendl1/diem_tsc/work/epa__tsc_4
Traceback (most recent call last):
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/dataManager.py", line 69, in stage_plasma_state
    ipsutil.copyFiles(source_dir, plasma_files, target_dir)
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/ipsutil.py", line 36, in copyFiles
    raise Exception('No such file : %s' %(src_file_full))
Exception: No such file : /global/project/projectdirs/atom/users/greendl1/diem_tsc/work/plasma_state/ss31615_3_psp.nc
2015-05-08 17:36:50,231 FRAMEWORK       ERROR    received a failure message from component ss31615_3@basic_time_step_driver@2 : (RuntimeError(u'No such file or directory',),)

and the following in ss31615.log

2015-05-08 17:35:00,874 epa__tsc_4      ERROR    Error staging plasma state files
Traceback (most recent call last):
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/services.py", line 1696, in stage_plasma_state
    retval = self._get_service_response(msg_id, block=True)
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/services.py", line 352, in _get_service_response
    raise response.args[0]
Exception: No such file : /global/project/projectdirs/atom/users/greendl1/diem_tsc/work/plasma_state/ss31615_3_psp.nc

I really don't get that last line, since that file does exist ...

greendl1@edison12:/project/projectdirs/atom/users/greendl1/diem_tsc> ls /global/project/projectdirs/atom/users/greendl1/diem_tsc/work/plasma_state/ss31615_3_psp.nc
/global/project/projectdirs/atom/users/greendl1/diem_tsc/work/plasma_state/ss31615_3_psp.nc

I think the init in $IPS_CSWIM_WRAPPER_PATH/bin/epa_tsc_mcmd.py has created this file, i.e., it exists within the work directory for that component ...

greendl1@edison12:/project/projectdirs/atom/users/greendl1/diem_tsc_error8> ls work/epa__tsc_4/bpave         input.EC                ps_update_state_eq.list    ss31615_3_ps.geq
bpmax         ITER_SJ.mdescr          ps_update_state_init.cdf   ss31615_3_ps.jso
divhis.EC     log.tsc                 ps_update_state_init.list  ss31615_3_ps.nc
eqdskasci.EC  movie.cdf               ps_update_state_pa.cdf     ss31615_3_psn.nc
eqdsk.EC      osprsou.EC              ps_update_state_pa.list    ss31615_3_psp.nc
fort66.EC     output.EC               sprsin.EC                  tsc.cgm.EC
geqdsk.EC     ps_update_state_eq.cdf  sprsou.EC                  wall_data

but I'm not clear yet on how it got there, or which line $IPS_CSWIM_WRAPPER_PATH/bin/epa_tsc_mcmd.py or which call to services.stage_plasma_state() is generating the error since in the init it has

200     # Copy current, prior and next state over to working directory
201         try:
202             services.stage_plasma_state()
203         except Exception, e:
204             print 'Error in call to services.stage_plasma_state()', e

but I'm confused as to how all the listed plasma state files are even created at that point.

Any tips here would be great.

Oh, and there is also this hint towards the top of the stdout log

--------------------------------------------------------------------------------
Driver init
basic_time_step_driver
--------------------------------------------------------------------------------
[10]
[10]
[10]
[10]
[10]
[PORTS =10 ][
'INIT', 'DRIVER', 'MONITOR', 'EPA', 'RF_EC', 'RF_IC', 'NB']

[10]

[10]

[10]

[10]

[10]

[10]

[SIMULATION_MODE =10] NORMAL


EPA init
[10]
[10]epa.init() called

[10]
[10]
[10]
[10]
[10]
input_file_name =  input.EC
Created TSC_input_file.TSC_input_file
TSC_input_file: acoef(4963) = 0.0 => using internal TSC fast ion source
TSC_input_file: acoef(4965) = 0.0 => using internal TSC lower hybrid heating/current drive
TSC_input_file: check_settings: bootstrap current model  IBOOTST =  3
[10]
[10]
[10]
[Error in call to services.stage_plasma_state()10 ]No such file : /global/project/projectdirs/atom/users/greendl1/diem_tsc/work/plasma_state/ss31615_3_psp.nc

It looks like the simulation just continues along after hitting that error?

dlg0 commented

The problem here was due to several things.

  1. The fastran_init didn't create the PRIOR and NEXT plasma state files - so I added that in a dev branch to ips-fastran.
  2. I needed to specify PLASMA_STATE_FILES in the fastran_init section of the config file to move the files back to the global plasma state work area.
  3. The simulation just continues merrily along after an error - due to the way errors are handled in the wrappers and not propagated to the framework I think.
  4. The opaque nature of the IPS error reporting in general - again, _the IPS needs to be made more robust to poorly configured config files, wrappers, python paths, etc_.

@bernhold @dlg0 While we can definitely do better in terms of error reporting (amongst other things). I think the core problem is the lack of proper documentation and/or automatic expression/enforcement of what each component wrapper "expects" from other wrappers (in terms of files and plasma state contents, ..etc) and what it promises to deliver after each of its externally visible methods are executed (again in terms of files and PS contents,..etc). As things stand now, this knowledge is spread between the configuration file and the component wrapper code itself.

This is the realm of CS contracts that we've always discussed implementing as part of SWIM/AToM. For this to work, we'll have to agree on the mechanism through which this knowledge is communicated to the framework (the framework can only checks what it knows about). The problem right now (and maybe even in the future) is how to keep this knowledge current as the code change (if you generate one more file, or expect one more file to be there before a method is invoked, you will have to remember to update the configuration file entry or the call to the framework that encodes this change. History tells us that this consistency is not one of our strong points.

I suggest first improving the error detection/handling/reporting as far
as possible without contracts.

On a parallel path (maybe a new ticket) we can begin discussing how to
incorporate contracts or something similar. But that will be a more
complex and longer-term activity.

On 05/11/2015 09:30 AM, Wael Elwasif wrote:

@bernhold https://github.com/bernhold @dlg0 https://github.com/dlg0
While we can definitely do better in terms of error reporting (amongst
other things). I think the core problem is the lack of proper
documentation and/or automatic expression/enforcement of what each
component wrapper "expects" from other wrappers (in terms of files and
plasma state contents, ..etc) and what it promises to deliver after each
of its externally visible methods are executed (again in terms of files
and PS contents,..etc). As things stand now, this knowledge is spread
between the configuration file and the component wrapper code itself.

This is the realm of CS contracts that we've always discussed
implementing as part of SWIM/AToM. For this to work, we'll have to agree
on the mechanism through which this knowledge is communicated to the
framework (the framework can only checks what it knows about). The
problem right now (and maybe even in the future) is how to keep this
knowledge current as the code change (if you generate one more file, or
expect one more file to be there before a method is invoked, you will
have to remember to update the configuration file entry or the call to
the framework that encodes this change. History tells us that this
consistency is not one of our strong points.


Reply to this email directly or view it on GitHub
#10 (comment).

David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 865-574-3147
http://www.csm.ornl.gov/~bernhold | Fax: +1 865-576-5491

dlg0 commented

I agree with @bernhold . Improving the error handling in the short term is a high priority as tracking down the true source of a failed run is the single most time consuming part of setting up an IPS run (at least in my experience).

The idea of contracts sounds good too, but we need something done in the short term first.

Certainly the conventions of where information resides is something that needs to be improved. For example, a wrapper would ideally not depend on the driver that drives it - or at least be generic to how it is driven, but that's exactly the problem I am having now.