scikit-hep/hist

[FEATURE] Add a function that gives values split into categories

Opened this issue · 2 comments

We're currently migrating our analysis framework from using coffea histograms [1] to this package, and one difference we've encountered is that while in coffea the values method would give a mapping from all the identifiers of the different category axes to the corresponding bins:

>>> hist.values()
{('duck',): array(5.), ('goose',): array(6.)}

hist would just give the bare array:

>>> hist.values()
array([5., 6.])

Sometimes the latter is more useful, but it would also be nice to have a function that gave the first output, as working out which bins correspond to which categories can be hard for larger histograms. I'm not entirely sure what this function should be called (or if it should be an option of values), but if you feel this would be helpful I'd be happy to try implementing it

[1] https://coffeateam.github.io/coffea/modules/coffea.hist.html

You have an array each - could this be more than one value? ('duck',): array([1,2,3])? If not, then I think that's just zip(hist.axes[0], hist.values), and I'd rather document a simple procedure than make a method you have to learn and look up for it unless it's quite natural and expected.

I think this would be a Stack method, actually. Maybe even .values on a Stack? Also we might not currently support a Stack of single bin histograms, but that could be fixed if we don't.

Yes, the arrays can be more than one value, in general it's a mapping from a tuple of all the combinations of category axes an array of the value of the remaining axes, so for instance for a hist with two category axes, "species" and "colour", and a regular axis with three bins, the output might be:

{('duck', 'red'): array([5., 4., 2.]), ('goose', 'red'): array([6., 3., 7.]), ('duck', 'blue'): array([3., 1., 5.]), ('goose', 'blue'): array([1., 2., 4.])}

I hadn't looked at the stack function till now, as it seems currently the only operation one can directly do on a stack is plot it (and I would prefer the values as a dict to inspect in the terminal/manipulate in code), but adding a .values function there which gave a dict might be nice, though for the case of multiple categories this wouldn't be sufficient. Maybe one could also add the ability to call .stack on a Stack, then finally do .values, but I think it might be simpler to have a single function to do this.