ORNL-Fusion/ips-wrappers

How to have the IPS checkpoint at all timesteps?

dlg0 opened this issue · 21 comments

dlg0 commented

@ORNL-Fusion/ips-support-team

How do I configure the IPS config file to checkpoint all timesteps?

I don't think there is a config parameter that explicitly checkpoints every tie step but you can easily to that if your time steps are evenly spaced.

Set MODE = PHYSTIME_REGULAR and PHYSTIME_INTERVAL = separation between time steps

Also see checkpoint section from Integrated Plasma Simulator (IPS) Documentation Release 2.1, October 18, 2011.

checkpoint_components(comp_id_list, time_stamp, Force=False, Protect=False)
Selectively checkpoint components in comp_id_list based on the configuration section CHECKPOINT. If Force is True, the checkpoint will be taken even if the conditions for taking the checkpoint are not met. If Protect is True, then the data from the checkpoint is protected from clean up. Force and Protect are optional and default to False.

The CHECKPOINT_MODE option controls determines if the components checkpoint methods are invoked.

Possible MODE options are:

WALLTIME_REGULAR: checkpoints are saved upon invocation of the service call checkpoint_components(), when a time interval greater than, or equal to, the value of the configuration parameter WALLTIME_INTERVAL had passed since the last checkpoint. A checkpoint is assumed to have happened (but not actually stored) when the simulation starts. Calls to checkpoint_components() before WALLTIME_INTERVAL seconds have passed since the last successful checkpoint result in a NOOP.

WALLTIME_EXPLICIT: checkpoints are saved when the simulation wall clock time exceeds one of the (ordered) list of time values (in seconds) specified in the variable WALLTIME_VALUES. Let [t_0, t_1, ..., t_n] be the list of wall clock time values specified in the configuration parameter WALLTIME_VALUES. Then checkpoint(T) = True if T >= t_j, for some j in [0,n] and there is no other time T_1, with T > T_1 >= T_j such that checkpoint(T_1) = True. If the test fails, the call results in a NOOP.

PHYSTIME_REGULAR: checkpoints are saved at regularly spaced “physics time” intervals, specified in the configuration parameter PHYSTIME_INTERVAL. Let PHYSTIME_INTERVAL = PTI, and the physics time stamp argument in the call to checkpoint_components() be pts_i, with i = 0, 1, 2, ... Then checkpoint(pts_i) = True if pts_i >= n PTI , for some n in 1, 2, 3, ... and pts_i - pts_prev >= PTI, where checkpoint(pts_prev) = True and pts_prev = max (pts_0, pts_1, ..pts_i-1). If the test fails, the call results in a NOOP.

PHYSTIME_EXPLICIT: checkpoints are saved when the physics time equals or exceeds one of the (ordered) list of physics time values (in seconds) specified in the variable PHYSTIME_VALUES. Let [pt_0, pt_1, ..., pt_n] be the list of physics time values specified in the configuration parameter PHYSTIME_VALUES. Then checkpoint(pt) = True if pt >= pt_j, for some j in [0,n] and there is no other physics time pt_k, with pt > pt_k >= pt_j such that checkpoint(pt_k) = True. If the test fails, the call results in a NOOP.

The configuration parameter NUM_CHECKPOINT controls how many checkpoints to keep on disk. Checkpoints are deleted in a FIFO manner, based on their creation time. Possible values of NUM_CHECKPOINT are:

•NUM_CHECKPOINT = n, with n > 0 –> Keep the most recent n checkpoints •NUM_CHECKPOINT = 0 –> No checkpoints are made/kept (except when Force = True) •NUM_CHECKPOINT < 0 –> Keep ALL checkpoints

Checkpoints are saved in the directory ${SIM_ROOT}/restart

Under [CHECKPOINT]

On Jul 14, 2015, at 11:58 AM, David L Green notifications@github.com
wrote:

@ORNL-Fusion/ips-support-team

How do I configure the IPS config file to checkpoint all timesteps?


Reply to this email directly or view it on GitHub.

dlg0 commented

Yeah, that's what I'd already done. You have to manually ensure consistency between the two (actual and checkpoint) timesteps. I was wanting an something more like CHECKPOINT_MODE = ALL.

dlg0 commented

Revisiting this again, in trying to minimize the amount of manual fiddling and possible inconsistencies that arise when editing the .config file, I'm requesting the CHECKPOINT_MODE = ALL capability so it just checkpoints every time step, irrelevant of what those time steps are.

Thanks,
David.

Done. Note that another way to do this currently is to call the checkpoint_component() method from the driver with the flag Force=True, but that would require code change.

dlg0 commented

Sweet, thanks. Have you updated /project/projectdirs/atom/atom-install-edison/ips-gnu-sf ?

Yes

dlg0 commented

Testing ...

dlg0 commented

Hmmm, while I have not implemented the CHECKPOINT_MODE = ALL option in the run, it is failing with an error I have not seen before ...

2015-09-08 08:31:08,206 init__minimal_state_init_1 ERROR    Uncaught Exception in component method.
Traceback (most recent call last):
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/component.py", line 121, in __run__
    retval = method(*args)
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/configurationManager.py", line 207, in initialize
    mpirun_version = self.platform_conf['MPIRUN_VERSION']
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/configobj.py", line 567, in __getitem__
    val = dict.__getitem__(self, key)
KeyError: 'MPIRUN_VERSION'

A copy of run is in /project/projectdirs/atom/users/greendl1/runs/diem_tsc_err_mpirun

dlg0 commented

This may have been due to me running on Hopper when I thought I was on Edison! Re-testing now.

No, the problem is in miniimal_state_init.py lines 110-112. It's checking for SIMULATION_MODE to be either RESTART or NORMAL, while the config file has
SIMULATION_MODE = REGULAR
There is a raise statement with no exception caught, causing it to raise the most recent exception which had been already dealt with (which happens to be the check for MPI_VERSION)

dlg0 commented

Sweet. Thanks. Perhaps a nice addition would be in such cases to print out what the available options are? Plus I'll add an exception handler to the minimal state unit wrapper.


From: Wael Elwasif notifications@github.com
Date: September 9, 2015 at 1:57:39 PM EDT
To: ORNL-Fusion/ips-atom ips-atom@noreply.github.com
Cc: Green, David L. greendl1@ornl.gov
Subject: Re: [ips-atom] How to have the IPS checkpoint at all timesteps? (#24)

No, the problem is in miniimal_state_init.py lines 110-112. It's checking for SIMULATION_MODE to be either RESTART or NORMAL, while the config file has
SIMULATION_MODE = REGULAR
There is a raise statement with no exception caught, causing it to raise the most recent exception which had been already dealt with (which happens to be the check for MPI_VERSION)

Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-138990561.

dlg0 commented

How exactly should this exception handling be re-written such that it gets caught and it displays this error, rather than the MPI_VERSION exception?

Well, there's no exception caught such that it's raised. So a new exception needs to be generated. Something like

            if mode not in ['RESTART', 'NORMAL']:
                print 'minimal_state_init: unrecoginzed SIMULATION_MODE: ', mode
                raise Exception("Error in minimal_state_init :  unrecoginzed SIMULATION_MODE:  %s ' %mode)

no need for the return statement, since raising the exception aborts execution

dlg0 commented

OK, so what I used is the following ...

        try:
            mode = services.get_config_param('SIMULATION_MODE')
        except:
            logMsg = 'minimal_state_init: No SIMULATION_MODE variable in config file. Please set NORMAL or RESTART'
            self.services.exception(logMsg)
            raise

        if mode == 'RESTART':
            print 'minimal_state_init: RESTART'
        if mode not in ['RESTART', 'NORMAL']:
            logMsg = 'minimal_state_init: unrecoginzed SIMULATION_MODE: ' + mode
            self.services.exception(logMsg)
            raise ValueError(logMsg)

which seems to work, except that log.framework and log.warning give me confusing / conflicting messages ... specifically, why is there a "No such file or directory" error showing up - that is uber confusing.

Good

greendl1@edison07:/project/projectdirs/atom/users/greendl1/runs/diem_tsc> cat log.framework
2015-09-10 10:09:10,570 FRAMEWORK       WARNING  RM: listOfNodes = [('5720', '24'), ('5721', '24'), ('5724', '24'), ('5729', '24'), ('5730', '24'), ('5731', '24'), ('5732', '24'), ('5738', '24'), ('5739', '24'), ('5743', '24'), ('5757', '24'), ('5758', '24'), ('5759', '24'), ('5767', '24'), ('5768', '24')]
2015-09-10 10:09:10,571 FRAMEWORK       WARNING  RM: max_ppn = 24
2015-09-10 10:09:10,571 FRAMEWORK       WARNING  Using user set procs per node: 24
2015-09-10 10:09:10,572 FRAMEWORK       WARNING  RM: 15 nodes and 24 processors per node
2015-09-10 10:09:11,876 FRAMEWORK       ERROR    received a failure message from component thisSim@minimal_state_init@1 : (ValueError('minimal_state_init: unrecoginzed SIMULATION_MODE: REGULAR',),)

Confusing

greendl1@edison07:/project/projectdirs/atom/users/greendl1/runs/diem_tsc> cat log.warning
2015-09-10 10:09:11,843 init__minimal_state_init_1 ERROR    minimal_state_init: unrecoginzed SIMULATION_MODE: REGULAR
Traceback (most recent call last):
  File "/global/project/projectdirs/atom/atom-install-edison/ips-gnu-sf/bin/component.py", line 88, in __run__
    os.chdir(workdir)
OSError: [Errno 2] No such file or directory: '/global/project/projectdirs/atom/users/greendl1/runs/diem_tsc/work/init__minimal_state_init_1'

@batchelordb This is for you

Not really,
The confusing part comes from the second use of

self.services.exception(logMsg)

When no exception has been caught here. It prints the trace from the most recent exception caught, which happens to be the chdir (this one ws caught and handled earlier by the framework).

dlg0 commented

Well I put the self.services.exception(logMsg) in there to have a message printed somewhere I can see. The part that is confusing is the No such file or directory error. Why is it there?

As I said, this was the last exception caught (in this case by the framework), so when you made the call to self.services.exception(logMsg), it not only prints the message you provide, but the trace from the most recently caught exception. This call is meant to be called ONLY after you catch an exception to provide more context to the logMsg (hence the name). If you simply want to generate the logMsg, use self.services.error(logMsg) instead.

dlg0 commented

So is this correct?

        try:
            mode = services.get_config_param('SIMULATION_MODE')
        except:
            logMsg = 'minimal_state_init: No SIMULATION_MODE variable in config file. Please set NORMAL or RESTART'
            self.services.exception(logMsg)
            raise

        if mode == 'RESTART':
            print 'minimal_state_init: RESTART'
        if mode not in ['RESTART', 'NORMAL']:
            logMsg = 'minimal_state_init: unrecoginzed SIMULATION_MODE: ' + mode
            self.services.error(logMsg)
            raise ValueError(logMsg)

The error messages are certainly much clearer.

This looks good, the general "rule" in Python is to catch exceptions thrown out by lower layers, then decide whether to (a) deal with them and "fix" the problem, or (b) pass them upwards as is, or (c) raise a different exception that is more meaningful to the caller(s). services.exception() expects to be called immediately after an exception has been caught, and it prints the stack trace to give context to whatever error message you provide. If called outside this context, then chaos ensues.

dlg0 commented

OK, then I'm testing the actual CHECKPOINT MODE = ALL now.

dlg0 commented

Seems to be working. Cheers.