cta-observatory/ctapipe_io_magic

Reading multiple files

Closed this issue · 8 comments

As discussed recently, reading only one subrun (at calibrated level) can cause problems if not enough drive reports are present. Therefore in #31 I am implementing reading multiple files from the same run i.e. all the subruns belonging to the same run. This is very much similar to what LSTEventSource does (the difference is that LST has files from (up to) 4 different streams for the same run).

So, as it is now, MAGICEventSource would take as input the first subrun of a given run, then finding all the remaining subruns of that run.

Would this be ok @jsitarek, @YoshikiOhtani?

I think that it's the best we can do at the moment. The reason is mostly because of ctapipe-process, which is the tool I (we) would like to use to perform the calibrated-->DL1 step, as soon as the MAGIC cleaning and hot/bad pixels treatment are included in ctapipe. In ctapipe-process, MAGICEventSource is used through EventSource, and the input_url, either given through the command line or through the config, will be passed as a part of the configuration. So it will not be passed as one of the arguments of the MAGICEventSource init, but will be passed directly to the EventSource init . Therefore, there only a single (existing) file can be passed (passing a string with wildcards will throw an error), and therefore from that filename all the other subruns are found.

What could be done would be to add an argument to MAGICEventSource where the user passes a list of runs to be processed, but since an input_url is always requested, one would need to pass the path to an existing calibrated file and then the MAGICEventSource would look for subruns belonging to (possibly) different runs, which is kind of confusing/strange to me.

Hi @aleberti, thanks for the updates. I think it would be fine for me to process the data as you said, but is it not possible to load only the drive information from all the subrun files but process the event subrun-wise? Because processing all the subrun files with only one job takes 1 or 2 hours, while if it's possible to process subrun-wise it takes just a few minutes and the jobs are simultaneously processed by batch jobs. What do you think?

Hi Yoshiki, this should be possible, let me test it. I will commit a modification so that you can test it.

Thank you very much, then the processing time would become so efficient if one uses the computing machine with batch job system. Anyway if one uses a machine which doesn't have the system, one can launch a bash script with which the subrun-wise for-loop processes will be run.

Hi Yoshiki, you can try the new changes. Remember, to get the behavior you want i.e. single subrun processed but drive information extracted from all the subruns of that run, you have to add process_run=False when calling MAGICEventSource.

Thank you Alessio, I have tested it and it works fine.

By the way, since the MarsCalibratedRun object is created subrun-wise, I think the initialization of the data container and np.concatenate are not needed now. So we can remove the lines related to this and simplify the code. It's not related to this issue and so if you agree, I can make a pull request of the refactoring.

By the way, since the MarsCalibratedRun object is created subrun-wise, I think the initialization of the data container and np.concatenate are not needed now. So we can remove the lines related to this and simplify the code. It's not related to this issue and so if you agree, I can make a pull request of the refactoring.

yes, in principle we do not need them anymore. Indeed even if multiple subruns are processed, MarsCalibratedRun is called for each subrun, I thought it was better like this memory-wise, in order not to end up with super big arrays.

OK thanks for your opinion, and I agree with you that it's better to avoid such a case, since I faced a memory problem in the past.