Proposal for a one-dimensional, unbinned, spectrum format
cosimoNigro opened this issue · 12 comments
Hello all,
the format currently foresees binned spectral data, the so-called OGIP files, in turn based on NASA's definition.
The OGIP format provides all the input necessary for a classical one-dimensional binned spectral analysis.
Me and @javierrico would like to ask whether there might be interest in adding an unbinned one-dimensional spectral format.
Unbinned one-dimensional likelihoods have been (and are being) used - for example - in dark matter searches, see the likelihood definition in Aleksić et al. (2012) and its later usage in Aleksić et al. (2014). Additionally, any point like analysis (e.g. extragalactic) can profit of this technique.
In my opinion it will not be a big effort to introduce such definition in the format.
Prototypical 1D unbinned spectral data should contain:
-
an ON event list, containing all the events in the source region with their reconstructed energies;
-
an OFF event list, containing all the events in the background region with their reconstructed energies;
-
A column, or a HEADER keyword, with the ratio between the ON and OFF region exposures.
[ alternatively to 1) and 2) a single Event list with reconstructed energies and the ON / OFF columns could be used; ]
-
unbinned effective area as a function of true energy, at the offset of the observation, corrected with the PSF (if full enclosure IRFs are provided);
-
unbinned energy migration, at the offset of the observation (alternatively resolution and bias as a function of true energy).
From the point of view of the science tools, I believe gammapy
and ctools
already produce data with such informations, they are simply not stored.
Such data should be the intermediate products obtained from DL3 right after the signal / background separation in SpectrumExtraction and csphagen, respectively.
They could be dumped in a FITS file before being binned to produce the OGIP files.
Any thoughts?
Thank you.
The definition for binned spectral data that you're referring to doesn't exist any longer in the current version of the specs, as far as I can tell (your link is to v0.1, latest is v0.2).
I don't see a big problem in coming up with format specifications for an unbinned spectral analysis, although I don't quite understand why you'd like to have this intermediate format. You should be able to perform the analysis with the standard DL3 level data, no?
Dear @lmohrmann, thanks for replying and sorry if it took me a while to come back to this.
I don't quite understand why you'd like to have this intermediate format. You should be able to perform the analysis with the standard DL3 level data, no?
Imagine one wants to perform a more sophisticated unbinned likelihood analysis with some methods actually not available in (or yet covered by) gammapy
/ ctools
, among these: dark matter or LIV searches - for example. One can use the science tools to reduce the data (i.e. perform the signal / background separation), and then perform the analysis with his tool of choice.
You might want also to consider a simple separation of processes: in a first step one can reduce a large amount of observations to spectral data and in a second, separate, step load them for the statistical analysis (this is what is currently possible with binned data).
I spoke privately with @adonath about this, and he supports the integration of such formats in gammapy
.
Starting from the MAGIC DL3 files in the joint-crab project, I created some prototypical unbinned spectral data in this repository
https://github.com/cosimoNigro/iact_unbinned_spectrum_format
as I thought, it was easy to do so with gammapy
, all the needed functionalities were already available
https://github.com/cosimoNigro/iact_unbinned_spectrum_format/blob/master/unbinned_spectrum_format.ipynb
the prototypical unbinned data are here
https://github.com/cosimoNigro/iact_unbinned_spectrum_format/tree/master/spectra/unbinned
As for the OGIP files I created a file for the ON and OFF data and one for the effective area and energy migration. The IRFs format are just OGIP files with a very fine binning.
I believe this is the necessary input for a 1D unbinned likelihood analysis.
I'd like to keep iterating with you on this, thanks.
Thank @cosimoNigro, for the prototyping on this issue.
Just a few more thought from my side:
For consistency we decided to only work with binned datasets in Gammapy for now, but long-term we would like to support unbinned analyses in Gammapy as well (side comment: basically because it's a requirement for the CTA science tools, personally I think the use-cases are very limited and in practice a fine energy binning will work just as good...). So far we have not decided how unbinned analyses will be integrated into our API.
The question is how we can support the unbinned case in our data reduction chain already, so that it can be used for @cosimoNigro use cases. For this I see two options:
- We add an optional
SpectrumDataset.events
,SpectrumDatasetOnOff.events
andSpectrumDatasetOnOff.events_off
attribute and extend theSpectrumDataset.write()
methods to also write those. - We implement a new
EventsDataset
andEventsDatasetOnOff
as data containers and add anotherEventsDatasetMaker
.
Currently I'd prefer option 1, because unbinned analyses are not a priority for Gammapy now and option 2 is much more work.
Concerning the data formats:
I basically agree with @lmohrmann, that we don't need any specialised data format here, because the building blocks are already defined:
- We can use the normal event list format for on and off event lists
- We can use fine-binned IRFs in OGIP format
So in summary my proposal would be the following:
- We add an
.events
and.events_off
attribute to theSpectrumDataset
andSpectrumDatasetOnOff
and offer another option to theSpectrumDatasetMaker(keep_events=True)
. - We offer a new format in
SpectrumDataset.write(format="ogip-events")
andSpectrumDatasetOnOff.write(format="ogip-events")
, which in addition writes the event files if present in the dataset.
What do you think @registerrier and @cosimoNigro?
And we should add a short paragraph to https://gamma-astro-data-formats.readthedocs.io/en/latest/spectra/ogip/index.html#, explaining this option of writing the event lists.
Dear @cosimoNigro & @adonath ,
I am not sure I completely follow here.
Is there any difference with fine bins 1D spectra? What do you gain specifically by keeping the events?
Conversely, if you prefer to go with events, then why loosing the spatial IRF information that cannot be stored with OGIP format?
What I understand from the paper you link, is that you get a better sensitivity by using a background model rather than with the wstat
estimate of the background counts per energy bin. In that case indeed you can go to very fine 1D bins or unbinned.
I am not sure another data format is needed then, it is simply that you want to perform a background model fit on your ON and OFF data simultaneously. Right?
Dear all,
This proposal for a format of such type of DL4 data is interesting and deserve indeed some discussion. I would like to replace it into the history of the gadf and its current usage.
We had many discussions on format, mainly on the DL3, a bit on DL4 and DL5+. As everyone knows, the level of advancement in very sparse between these three levels.
Nowadays, one can claim that the DL3 format has some maturity, well described and used by more and more experiments. CTA has adopted it for its latest Data Challenge. This does not mean that this format is frozen, evolutions are always possible.
Concerning the DL4 format, few discussion have been made. It appears also that some Science Tools has different views regarding what should be a DL4, and thus their formats are different. They are interrogation about what should be stored or not, which kind of intermediate products should be kept. The discussions have not led to a 'common DL4 format', even for the most basic case (see the latest post of @registerrier about the OGIP format).
For the DL5 format, I think that the situation is not clear also. The latest publications of the current experiments or CTA do not share or use systematically the same format. It deserves also additional discussions to be more precise and complete all types of products (and Science Cases).
To come back to your proposal about such DL4 product, my feeling is that a F2F discussion with developers of Science Tools (hess, magic, veritas, ctools, gammapy) might help to identify all the issues and to propose formats for different use cases. It is hard to discuss the unbinned 1D spectrum format alone, the binned one needs an agreement also, as for the sky maps format. To be planned for a CTA meeting?
Dear @adonath, @regis and @bkhelifi, thanks for your comments.
Is there any difference with fine bins 1D spectra? What do you gain specifically by keeping the events?
certain science cases demand the knowledge of event-wise observables. For example in search for Lorentz Invariance Violation (LIV) you would study the correlation between the energy and the arrival time of individual photons (see Martinez et al. (2009)). LIV searches are performed mostly for AGN and GRB, so in that case (as for any point-like source) you would accept
losing the spatial IRF information
Also, as @adonath pointed out, the unbinned analyses are a requirements for the CTA science tools.
If you are against the definition of an unbinned
data format per se in this precise moment, then - as Axel suggested - we can just add a paragraph in the OGIP description, explaining how to "keep the events". That is how to store ON and OFF event lists with some quantities of interest, along with fine-binned IRFs. Simply listing energy and timestamps of ON and OFF events would already allow to perform LIV and unbinned (1D) dark matter searches. I would start with these two.
@bkhelifi I understand it would be appropriate to discuss the DL4 and DL5 implementation in person, but I think I am asking for a small addition to the spectral data format that is easy to implement in the actual tools and that can be eventually expanded in the future.
Thanks
Hi,
The CTA requirements for the Science Tools are not official yet. The document is waiting for a validation.
In the last Bologna workshop about the ST requirements, the SUSS workshop end 2018, no requirement was set about unbinned analysis. Just for information....
It does not mean that one should not support it, of course!
Cheers,
B.
Thanks for the clarification @bkhelifi! For some reason I had this in mind incorrectly. I'll study the current status of the requirements once again.
As mentionned by Bruno, SUSS requirements are not yet on Jama as not fully validated.
And in the minutes of the workshop there is no mention of unbinned analysis.
If to be supported, it may be time to ask for additional requirements.
Thanks again @bkhelifi and @cboisson for the clarification. If there is not even a requirement for unbinned analysis in the science tools, then I really think there is no point in talking about special DL4 data formats now. It seems too early.
@cosimoNigro To solve this issue here pragmatically I would propose, that we avoid the data format discussion for now. Instead we still add SpectrumDatasetOnOff.events
and SpectrumDatasetOnOff.events_off
to the Gammapy data structures. This gives you at least the possibility to write them to disk (with events.write()
), using the already defined standard format and starting from this you can do whatever you want...what do you think?
@adonath thanks for your offer to implement it in gammapy
, I can and will take care of that myself.
I believe the addition of what I am proposing to the data formats (and not only to gammapy
) is pertinent; the forum is the place where we might want to clarify some details, for example:
- which event quantities shall we store?
(I think energies and timestamps can be enough for the moment) - which informations shall we add into the headers?
(I think on to off exposure ratios and effective time are mandatory) - what will be the file nomenclature?
(for the binned we usedpha_
andbkg_
for the counts - borrowing from the OGIP ... how do we distinguish binned and unbinned counts? do we keeparf_
andrmf_
for IRFs?)
I'd like to have these informations clarified here, as they pertain to a new kind of data we will make possible to store, besides the OGIP. Other higher-level tools (implementing more complex statistical analysis or astrophysical modelling) might want to refer here for their inputs.
I also agree these specification can go into a keep-events
(or event-wise
) subsection of the binned
section, rather than creating a new unbinned
section.
Now what I want to understand is if there is somebody explicitly against this addition to the data formats, if yes:
Is it because it's not supported by enough science cases?
Is it because it's not expandible in the future?