scikit-hep/hist

[FEATURE] Add a function to integrate axes (including over partial ranges)

Dominic-Stafford opened this issue · 1 comments

It would be useful to have an integrate function, which could be used to do the following:

  1. Remove a single axis from a histogram, reducing its dimension by 1: h.integrate("y")
  2. Integrate over a range of an axis: h.integrate("y", i, j)
  3. Sum certain entries from a category axis: h.integrate("y", ["cats", "dogs"])

Currently it is possible to do all of these things, however the syntax is unclear and there are a number of pitfalls:

  1. Can reasonably easily be achieved with h[{"y": sum}] or h[{"y": slice(None, None, sum)}], though would be nice to add for completeness.
  2. Can be achieved with h[{"y": slice(i, j, sum)}], however the more obvious h[:, i:j]["y": sum] will give the wrong result, since sum includes the overflow as noted here: scikit-hep/boost-histogram#621
  3. For this, the corresponding h[{"y": ["cats", "dogs"]}][{"y": sum}] almost works, as with this slice any other categories don't seem to be added to the overflow. However, if the overflow already contains entries, these will be added to the sum, so seemingly the only way to get the correct result is to do the sum by hand: h[{"y": "cats"}]+h[{"y": "dogs"}] which could quickly become laborious. (Could be done as h[{"y": ["cats", "dogs"]}][{"y": slice(0, len, sum)}])

Linked to this issue, it would be helpful if one could specify whether to include the overflows when projecting out axes using the project method, which if adding a new function is not desired, would at least make some other work-arounds easier.

@fabriceMUKARAGE, here is a rough draft of what the method of BaseHist would look like.

# Loc is int | str | ...
def integrate(self, name: int | str, i_or_list: Loc | list[str | int] | None = None, j: Loc | None = None]) -> Self:
    if is_instance(i_or_list, list):
        return self[{name: i_or_list}][{name: slice(0, len, sum)}]
    
    return self[{name: slice(i_or_list, j, sum}]

Rough draft of tests:

def test_integrate_simple_cat():
    h = hist.new.IntCat([4, 1, 2], name="x").StrCat(["AB", "BCC", "BC"], name="y").Int()
    h.fill(4, "AB", 1)
    h.fill(4, "BCC", 2)
    h.fill(4, "BC", 4)
    h.fill(4, "X", 8)
    h1 = h.integrate("y", ["AB", "BC"])
    assert h1[4j] == 5