cta-observatory/protopipe

Comparison with EventDisplay : Modeling

Opened this issue · 4 comments

This is sub-summary issue part of #85 .

This part of the pipeline is more complex and structured than the others because it takes into account both energy and classification.
It will be subdivided later on.

Requirements

  • Image cleaning and parametrization (#87)
  • Direction reconstruction (#88)

Reference documents

  • Prod3b IRF report (link)
  • Benchmarks (benchmarks/DL2/benchmarks_DL2_direction-reconstruction.ipynb)

Known sources of possible divergence and current status:

  • as reported in this issue "[...] all training of g/h separators and cut optimisation steps are done after the application of the multiplicity cuts [...]" (configurable, but according to the IRF report "[...] minimum multiplicity of three (for 30 min and 100 s exposures) or four (for 5 h and 50 h exposures) [...]", this means that there will be the need for 2 different analysis instead of 1)
  • ENERGY
    • training and use of LUTs
    • main gamma-selection parameters mean scaled width and mean scaled length
  • CLASSIFICATION
    • boosted decision trees (using the TMVA implementation)
    • in seven different energy bins (configurable)

@GernotMaier : regarding the multiplicity cut, it seems to me that there could be a contradiction between what you said here (at the bottom of the discussion) and what is written on the IRF report (point 1 page 22)

On the report, you say that you keep multiplicity >=2 throughout the analysis, but on the pyirf issue you said that all model training is done after the application of the multiplicity cuts, which in the context of pyirf (aka your cut optimization) is 4.
Do you use 4 (3 for 100s and 30min exposure cases) also for the training?

IRF report is from 2017, cuts and different analysis steps changed slightly since cut. Multiplicity cuts are applied at the training stage.

Really depends on the files you are looking at if it is 2, 3, or 4. I usually prepare IRFs for these three cuts; but for most applications use then the 3-tel cut (which provides best balance between sensitivity and resolution)

I see.

So you keep N_tel_reco_direction = 2 up to the model training, then, depending on the analysis/IRF you want to do, you select between, let's say, N_tel_reco_training = 3 (or 4 ), and then you keep N_tel_reco_training = N_tel_reco_cuts.

Am I correct?

Multiplicity doesn't influence any of the steps for the BDT training, so assumption is applied before this step on multiplicity. From the BDT training onwards, I keep the multiplicity the same through all analysis steps.