cms-sw/cms-bot

Add new label for DataFormat breaking PRs

francescobrivio opened this issue · 17 comments

From the operational point of view it would be very useful to have a breaks-dataformats (or something similar) label that can be assigned to PRs which involve the change of the RAW data-format or of any of the derived data-formats (AOD, ALCARECO...).
This label can be used to easily spot when we a processing-version or an Era update are needed in Tier0 when a new release is being deployed.

I'm not sure if the label creation can/should be automatized somehow by checking if changes are done to the DataFormat package, since I can imagine having PRs that touch this package without necessarily modifying the data-formats (or if there are other packages that might lead to similar changes in data formats).

Alternatively it can be left to L2s to assign the label to a PR when they review it.

cms-bot internal usage

A new Issue was created by @francescobrivio.

@Dr15Jones, @sextonkennedy, @smuzaffar, @makortel, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

assign operations, core

New categories assigned: operations,core

@Dr15Jones,@davidlange6,@makortel,@rappoccio,@antoniovilela,@smuzaffar,@fabiocos you have been requested to review this Pull request/Issue and eventually sign? Thanks

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

Assuming the first option, I'd imagine we could easily label automatically all PRs touching DataFormats and IOPool/Streamer, with the ability of removing such label from those PRs that do not change the data layout.

@makortel , is there any way we can check this during PR tests e.g. comparing class versions /checksum ?

is there any way we can check this during PR tests e.g. comparing class versions /checksum ?

I guess theoretically the checksums of all classes defined in classes_def.xml files could be compared. This comparison would have to be different from the edmCheckClassVersion, because that script checks the checksums only for those classes for which a version and checksum are defined in the classes_def.xml, whereas in this check we'd want to compare everything.

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

From the operations point of view, it's definitely more important to cover all such PRs (i.e. some amount of false positives is ok), especially if we give the reviewers the option of removing such label in case it's not needed.

We should avaoid at all cost missing some of the PRs that could potentially break the data layout!

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

From the operations point of view, it's definitely more important to cover all such PRs (i.e. some amount of false positives is ok), especially if we give the reviewers the option of removing such label in case it's not needed.

Thanks. Then I'd look into some relatively simple way of labeling all PRs that can potentially break streamer file compatibility, and asking the PR reviewer(s) to remove the label if it is clear the data format itself remains unchanged. Maybe the bot could issue a specific message on that to remind the reviewers?


About the label name, I have mixed feelings about breaks-dataformats. We have other data format packages as well (SimDataFormats etc), whereas the use case seems to be heavily connected to data taking. I'm worried using a general name "data formats" could be easy to misunderstand to cover all data formats, rather than only those related to data taking.

Reading the issue description again I became a bit confused. I recall the discussion in ORP was about streamer file compatibility (which made me think breaks-streamer), but the issue description talks about RAW or AOD etc, i.e. the output of Tier0 (in ROOT files, in which case the "break" can be too strong because in many cases the old files can be read thanks to schema evolution).

What about event content definition changes? Do those need similar special attention in Tier0?

About the label name, I have mixed feelings about breaks-dataformats. We have other data format packages as well (SimDataFormats etc), whereas the use case seems to be heavily connected to data taking. I'm worried using a general name "data formats" could be easy to misunderstand to cover all data formats, rather than only those related to data taking.

Reading the issue description again I became a bit confused. I recall the discussion in ORP was about streamer file compatibility (which made me think breaks-streamer), but the issue description talks about RAW or AOD etc, i.e. the output of Tier0 (in ROOT files, in which case the "break" can be too strong because in many cases the old files can be read thanks to schema evolution).

What about event content definition changes? Do those need similar special attention in Tier0?

Hi Matti,
indeed the request I made is more connected to data-taking: I believe we are interested in marking also cases were the RAW or derived data formats are changed because this requires a change of Era or processing version in Tier0.
Quoting directly from the ORM Twiki (revision=106):

General guidelines for requesting an era change as opposed to a processing version change :
 - Change of ERA:
      - any change of RAW data-format;
      - [...other conditions not relevant to this discussion...]
 - Change of processing version:
      - update of derived data-formats (AOD, ALCARECO);
      - [...other conditions not relevant to this discussion...]

Additionally, we are definitely interested in marking the PRs that involve a change of the streamers format.

I have no strong opinion on the label name I proposed in the original issue description, any other name that can help identify quickly and easily such PRs is perfectly fine for me, I don't know if other have a different option.
(Maybe we could go with a "less strong" changes-dataformats?)

Hello, what's prognosis for this issue?