scikit-hep/root_numpy

Accessing variable-width item in TTreeFormula

jpivarski opened this issue · 9 comments

I have a complex ROOT file (custom classes in TClonesArrays), but I know the field names, so I can access all values of a given leaf using TTreeFormulas with dots. For instance, my file of 47407 events contains a "Muon" branch (type TClonesArray) with, on average, 1.17 TMuon class instances in each event's array (total of 55560 muons). One of the fields in this class is pt.

I should be able to get all the muon pTs with

muonPt = root2array("ttbar.root", branches=["Muon.pt"])

but now I don't know which muon pTs correspond to which events. I can do

muonSize = root2array("ttbar.root", branches=["@Muon.size()"])

to get the number of muons in each event and run it through numpy.cumsum to get the muon array index corresponding to each event index.

The problem is that the muonPt and muonSize arrays have length 47407, which is the number of events, not the number of muons. I know this because I've checked it in ROOT, but I can also see that my cumsum result ends with 55560, as expected. Also, a field with only one value per event ("Info.nPU") results in an array of size 47407 as well.

I think ROOT is computing the whole array, but root_numpy is truncating it. When I do the same for "GenParticle.pt", which has an average of 56.8 particles per event, it takes a lot longer and a lot more intermediate memory, but gives me another small array.

It sounds like an easy fix. I can provide code to generate the list of leaf names, which could be wrapped up in a useful function that produces columnar data for arbitrarily complex TTrees (a feature request in this forum), once this bug is fixed. Would you like my test sample (220 MB) and scripts?

If the problem is not knowing how large to allocate an array before processing the TTreeFormula, could you add a user-supplied hint to the API? I can do the @Muon.size() calculation first and provide its numpy.sum as that hint.

Oh wait, I got it--- it's doing more work than I expected it to. It gave me an array whose length is the number of events because it's an array of arrays. I don't know how I didn't notice that before.

I'd like to understand this part of the code because I was trying to invent it elsewhere. (I'm looking for the lower-level functionality of going straight from whatever ROOT's internal format is to flat arrays with indexes.)

ndawe commented

Hi @jpivarski. This functionality is in converters.pyx. root_numpy hooks up the appropriate converter based on type and shape (and if shape is fixed or variable). All converters inherit from Converter and implement a write method that knows how to dump the branch data into the array. Fixed-length arrays are placed directly into the output array but variable-length arrays end up as PyObject* pointers to separate arrays.

I was thinking the same thing: that a converter would need to be overloaded to get the functionality I'm looking for.

Here's a graphic I made for a talk that illustrates what I want to extract. In addition to long arrays for each substructure element (e.g. pT, eta, phi of muons), we'd need two short arrays for the first index and number of elements (muons), or first and last index, or Dremel repetition levels and definition levels, etc.

Operations over these long arrays would be more efficient than an array of arrays because (1) the data are contiguous in memory and pass through CPU cache more predictably and (2) there are Numpy functions that could operate on them, without having to define any new ones in C. The two indexes allow us to reconstruct the original objects (perhaps with some fancy indexing or advanced Numpy functions).

It's the sort of thing that would be behind a higher-level API that lazily presents these to the user as objects, while the calculations run at full speed on the split arrays. It should be easier to get them, since this picture is basically what TBasket contents look like.

ndawe commented

Interesting. Would you place variables like muon_pt in their own tree to avoid branches of different lengths in the same tree and reading across multiple entries within the same tree?

ndawe commented

If you place each flattened collection in their own tree, then root_numpy can already handle this kind of scheme and you can perform all the indexing in pure numpy after first tree2array-ing each tree separately.

Assuming I don't have power to change the original root file.

ndawe commented

So if the features are in subarrays in your input file, then you can also use root_numpy.stretch() to flatten them out:

http://rootpy.github.io/root_numpy/reference/generated/root_numpy.stretch.html#root_numpy.stretch

With return_indices=True you have all info needed to invert the operation.

ndawe commented

Passing a view of the full array that only contains muon fields, for example, stretch will give you a recarray where each field is flattened (by one level of nestedness).