NNPDF/nnpdf

Ordering of groups_covmat

scarlehoff opened this issue · 4 comments

While testing #2051, I've noticed that groups_covmat is not following the ordering of the dataset inputs (it is not a problem for #2051 I think). You can check this quickly by simply putting an atlas dataset in the middle of cms.
This should not be a problem as long as we are always using it as a dataframe, but I thought it would be good to make you aware of this @RoyStegeman @andreab1997

I'm combing the code to figure out if it is a problem somewhere, and if it is not, I'll explicitly change the ordering of the dataset in the tests so that we are testing the case in which a dataset is in the middle of the others.

Thanks for pointing this out. #2051 doesn't change anything about the ordering so it shouldn't be a problem for that PR. I am aware the the groups_covmat ordering is different and when summing with the exp covmat we use reindex (which I think should be used everywhere we deal with covmats/predictions/datapoints, or at least checked if the ordering is the same, but that's another point).

It may indeed be good to add such a test.

Yes, I don't see any (obvious) problem in the code. The tests however rely on the order being the same at various places. I noticed when swapping a few datasets and #2051 was indeed ok but a few other tests fail.

For now I am going to explicitly mark those tests (and reorder within the test) as those where a change is expected. We can then inspect them with a bit more care.

The tests however rely on the order being the same at various places

Yes that's true some functions assume the input from an earlier function to be ordered in a certain way. I strongly dislike this, but obviously there is so much other stuff to do that addressing it never made it to the top of my todo list. Perhaps we can make this issue more general with as goal that no functions rely on the ordering of the input from the previous functions, but instead dataframe labels should be used everywhere (where possible).

In principle this is too strong a statement since sometimes we have a single functionality and just break it up into separate functions to keep things more readable. Perhaps a more suitable statement would be that no function that is used as a action in reportengine should depend on the ordering of the input, but organising that in practice is a lot of refactoring work and I'm not sure it would come at the cost of readability.

Nevertheless, improving it is possible and would not be a bad job for a new phd student to familiarise with the code. Unfortunately, also they generally don't have time for this sort of thing.

Perhaps a more suitable statement would be that no function that is used as a action should depend on the ordering of the input

This I like and can easily be tested.