cta-observatory/pyirf

Handle minimal Provenance information

Opened this issue · 8 comments

The idea is that we want a minimal set of information to be stored in some way that explains where a given set of IRFs comes from: i.e. provenance information by IVOA standards, like input DL2 files origin, cuts, and optimization target for these IRFs.

In ctapipe, this is handled by a dedicated provenance module that is called by each component/tool.

But as pyirf is meant to be self contained, a simpler mechanism might be setup.

Actually, we might want some kind interface that could use either the simple local provenance module (for users playing around) or the ctapipe official module when in full official production.

A very easy way (used by the current master) is to just create secondary files (in that case 1 per particle type with the selected simulated events, and 1 containing a table of the final cuts used to create the IRFs.

Personally I don't think it's that bad to just output a single FITS or HDF5 file containing all the auxiliary/provenance information.

Actually, it should not be forbidden to add HDUs to the final FITS file where the OGADF information will be encoded.

So we could also think of adding an HDU for each of the additional information we want to deliver (see issue #6 )

The idea is that we want a minimal set of information to be stored in some way that explains where a given set of IRFs comes from: i.e. provenance information by IVOA standards, like input DL2 files origin, cuts, and optimization target for these IRFs.

In ctapipe, this is handled by a dedicated provenance module that is called by each component/tool.

But as pyirf is meant to be self contained, a simpler mechanism might be setup.

Actually, we might want some kind interface that could use either the simple local provenance module (for users playing around) or the ctapipe official module when in full official production.

The provenance should be handled at the pipeline level, e.g. using the ctapipe module, or any other mechanism.
As we decided that pyIRF would be an independent library, called by such pipeline, I don't think we should be concerned by the provenance here.

NB: if provenance info is added later to the OGADF, then sure we will follow the evolution of the format

This provenance also has to include things like what were the inputs, outputs, etc. E.g. was this DL2 data from EventDisplay or ctapipe? Was it from Prod3 or Prod5? etc. What steps were applied to it? A minimal set is to support the CTA Reference Metadata (the same headers we put now in the DL1 files), but more detail will also be needed.

One other way is to use the LogProv system from Matthieu and Enrique, which is so far tested with gammapy - it allows one to attach provenance tracking information at the function-call level (so nice for user scripts that use the PyIRF system), usually by just adding some decorators.

see https://github.com/mservillat/logprov

This provenance also has to include things like what were the inputs, outputs, etc. E.g. was this DL2 data from EventDisplay or ctapipe? Was it from Prod3 or Prod5? etc. What steps were applied to it? A minimal set is to support the CTA Reference Metadata (the same headers we put now in the DL1 files), but more detail will also be needed.

One other way is to use the LogProv system from Matthieu and Enrique, which is so far tested with gammapy - it allows one to attach provenance tracking information at the function-call level (so nice for user scripts that use the PyIRF system), usually by just adding some decorators

My point exactly, and pyIRF has no way to know where these files come from and what was done with them prior to DL2, so the provenance should be dealt with at a higher level, no?

I think we are confusing between 2 "provenances" here:

  • the simtel to DL2 provenance (which as @vuillaut says it's a pipeline matter)
  • the provenance produced by pyirf (like e.g. the final optimization cuts used to create the IRFs) which should be part of pyirf output (either together, or separated from the "pure" OGADF IRFs information)

Indeed, to efficiently build the chain of provenance, ideally, each package has to provide its inputs/outputs and give information on the execution. Each dataset will have a dedicated identifier that is used to make the connection with the previous steps in the chain.

In the case of pyIRF, it might be interesting to adapt the logprov Python module (initially part of gammapy in a dev version, in connection with the high level interface). However, it may not be adjusted to the structure of pyIRF yet.

@mservillat This is exactly the kind of thing I want to avoid baking into pyirf right now.

We offer small, modular functions that do one thing, so any user (like lstchain, protopipe, future ctapipe tools, someone else) can choose their own config and provenance system, since now standard agreed upon solution exists.