ORNL-Fusion/ips-wrappers

Have IPS copy and store binaries - and be able to re-run from them?

Opened this issue · 2 comments

dlg0 commented

@bernhold @elwasif @batchelordb

So an interesting use case occurred today. @murakamim has two FASTRAN runs he is comparing, but they are using different values of a parameter that I think is hardwired into the binary itself. So Masanori requested a feature, @parkjm added that feature, and then Masanori re-ran. It turned out that some subsequent case that Masanori ran now failed, that didn't fail with the original binary.

What would be a good capability here is to re-run the case with the original binary ... which no longer exists. Of course this could be accomplished via appropriate binary versioning etc, but this is physics so that doesn't happen. One alternative that occured to me was to have the IPS framework copy all the binaries used in the workflow execution to a simulation_binaries directory or the like, the same way all the python files are copied into the simulation_setup directory. And then, have the capability to re-run an existing IPS simulation using the binaries and python files etc that were stored within the run directory itself, say by simply setting some config file variable like RERUN_FROM_COPIES=1.

I recall there being something like a replay component. Is it's functionality anything like what I've described?

If not, I think it could be quite useful to have the capability to re-run and existing run from only those files (including the binaries, python, and input files) now stored within the run directory, rather than pull them again from their original locations (which in the above use case have now changed).

Thoughts?

@dlg0

According to post (ORNL-Fusion/ips-fastran#9 (comment)), the problem of @murakamim is not related to binary change (I have not changed the fastran binary in public location) Anyway, your suggestion is good thing to discuss, though I'm not sure if it's possible in practical sense.

David,

This kind of thing comes up often in circles where people think about
reproducibility of scientific results. Doing this in the general case
is extremely challenging. Just a couple of examples of things that
cause problems...

If your executable is dynamically linked to any libraries, in principle
you need to capture those too.

It is common for OS, compiler, and other upgrades to force recompilation
of executables. Saved executables are likely to have a limited shelf life.

Another point is that having the old binaries merely gives you the
possibility to run some prior version of the code. But, I would argue
this is fundamentally unhelpful. What you really need to know is the
WHAT IS THE DIFFERENCE between the two versions. For that, you need
careful versioning of the source code and the ability to associate those
version numbers with the executables that are used in any given
simulation. And if you have good versioning of the source code, then
you DON'T NEED to capture the binaries because you can recreate any
binary you want in a way that you can be confident it will run today
(leaving aside the question of whether you've versioned the third-party
libraries you're relying on too).

So, my conclusion is that you don't want to save the binaries, you
really want to push harder to get the source code properly versioned.

It doesn't have to be that hard. If your code is in a version control
repository, they already provide a perfectly good unique identifier you
can use (though you are welcome to invent your own versioning scheme
too). What you need to do is to get that version info into the
executable, so that you can (for example) print it out at the beginning
of a run. This could (should) also be part of the metadata that would
be captured in the MPO system, and maybe the IPS could make a special
point of gathering version info of everything used in a run, distinct
from the MPO.

The other thing that's useful to know is whether what you built is
actually the version from the repo it claims to be, or whether it has
been modified. I would treat this as a binary (yes/no) and not try to
capture the differences from the repository. Real science should only
done with code that identical to a repository version. If the code has
been modified, you're in development mode, not science mode. SVN and I
think git provide tools to tell if your working directory differs from
the repo version. This kind of check can be build into the build system
and the version identifier that goes into the executable gets modified
to give a clear indication that it is derived_from a given repo
version rather than being exactly some repo version.

It would be good to do this with the physics codes, the wrappers, and
the IPS itself.

On 10/08/2015 11:24 AM, David L Green wrote:

@bernhold https://github.com/bernhold @elwasif
https://github.com/elwasif @batchelordb https://github.com/batchelordb

So an interesting use case occurred today. @murakamim
https://github.com/murakamim has two FASTRAN runs he is comparing, but
they are using different values of a parameter that I think is hardwired
into the binary itself. So Masanori requested a feature, @parkjm
https://github.com/parkjm added that feature, and then Masanori
re-ran. It turned out that some subsequent case that Masanori ran now
failed, that didn't fail with the original binary.

What would be a good capability here is to re-run the case with the
original binary ... which no longer exists. Of course this could be
accomplished via appropriate binary versioning etc, but this is physics
so that doesn't happen. One alternative that occured to me was to have
the IPS framework copy all the binaries used in the workflow execution
to a |simulation_binaries| directory or the like, the same way all the
python files are copied into the |simulation_setup| directory. And then,
have the capability to re-run an existing IPS simulation using the
binaries and python files etc that were stored within the run directory
itself, say by simply setting some config file variable like
|RERUN_FROM_COPIES=1|.

I recall there being something like a |replay| component. Is it's
functionality anything like what I've described?

If not, I think it could be quite useful to have the capability to
re-run and existing run from only those files (including the binaries,
python, and input files) now stored within the run directory, rather
than pull them again from their original locations (which in the above
use case have now changed).

Thoughts?


Reply to this email directly or view it on GitHub
#35.

David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 865-574-3147
http://www.csm.ornl.gov/~bernhold | Fax: +1 865-576-5491