Integrating a selection from a TTree more efficiently than TTree::Draw()
gbesjes opened this issue ยท 42 comments
I'm looking for a fast way to select certain events from a TTree, including one or more weights, and then integrating them.
A typical use-case for this is for example a cutflow. I now achieve that by selecting events into a histogram in a utility function:
selection = "({0}) * ({1})".format(selection, weight)
tree.Draw(var,
selection=selection,
hist=hist)
where this is created and filled from another function that does the following:
hist = Hist(1, -1, 2)
branch = tree.GetListOfBranches()[0].GetName()
loadHistogramFromTree(tree, hist, '{0}=={0}'.format(branch), cut, weight)
return hist.Integral()
In other words, I just select a weighted set of entries from a TTree and am interested in what the total number of events passing a selection is.
I've looked into tree2array, but notice several downsides:
- weight_name is just single weight, while I'm looking to multiply several. Perhaps a useful improvement would be the possibility to specify weight_names as a list of names?
- the weights are branches that exist in the TTree. What if I'd like to scale to lumi by doing something like 12.345/0.001 * normWeight (where this column is a weight derived from the cross-section of the sample and 12.345 is an example lumi)
These issues would also solve for example a more efficient plotter if one wants to select events along the lines of something like
12.345 / 0.001 * join("*", weights) * (cut)
into histograms and stack them all.
Perhaps I misunderstand how to combine tree2array and array2hist to get a weighted histogram, but right now I'm stuck with the old-fashioned draw methods.
Would anybody have a suggestion? Anything that relies on something faster than TTree::Draw() would be great - that would allow me to benefit from the nice benchmark figures advertised :)
@gbesjes thanks a lot for this feedback. I'll try to clarify a bit how you can accomplish what you need.
Firstly, tree2array's weight_name is just to assign a name to the field in the output array that will hold the value of tree.GetWeight() (same value for each entire tree in a chain). We just wanted a way to conveniently extract that info from the tree so it can be treated like any other branches that represent a weight. weight_name is configurable to give the user the ability to avoid clashing with existing branch names in the tree.
To fill a histogram with weighted entries where the weights are products of weight branches (and any other factors) then try something like this:
weight_branches = ['your', 'weights']
arr = tree2array(tree)
weights = reduce(np.multiply, [arr[br] for br in weight_branches])
# reduce is removed in py3 but can use functools.reduce or explicit for loop
fill_hist(hist, arr['branch'], weights)
Possibly even better is to just give tree2array the complete expression (at least the factors that are branches in the tree) that produce the entry weights:
weights = tree2array(tree, branches='branch1 * branch2 * 12.345 / 0.001')
To fill a histogram with weighted entries where the weights are products of weight branches (and any other factors) then try something like this:
Actually, it would be much more efficient to add an alias to the tree and retrieve the branch by that alias. At least that way, you're relying directly on ROOT to do this.
tree.SetAlias('alias','formula')
and root_numpy can access this normally without a problem. We might want to think of incorporating something like this to make this less involved.
But root_numpy anyway uses TTreeFormula for any expression that isn't a branch name. So an alias doesn't really change anything.
(oops clicked wrong button ๐ )
But root_numpy anyway uses TTreeFormula for any expression that isn't a branch name. So an alias doesn't really change anything.
Ahh, I didn't see the second point in your post. Yeah, if you're using TTreeFormula, then it should just work as expected!
@ndawe thanks for the quick answer! That solves what I want to achieve. Would there be an obvious improvement in tree2array() if I specify a branch like "1==1" in this case? For a cutflow I'm not really interested in a variable, just a count is enough :)
And of course the same thing is true for a distribution: I assume that if I want to plot variables X, Y and Z the code will be a lot more performant if only those branches are thrown into numpy arrays. Is that indeed the case?
Yeah, "branches" can be a list of branch names and/or expressions. It should be able to handle anything that you can throw at TTree.Draw(). We named the argument "branches" before expressions were supported. In hindsight something like "fields" might have been more appropriate.
Excellent! I thought this wasn't possible because of the name "branches". I'll try it out after a few meetings. Perhaps the docs can clarify that any valid expression also works, in case other people run into the same.
For a cutflow you could also just sum up the array lengths. Yes, if you only want a subset of the branches (possibly mixed with expressions) then specifying them with the branches argument will lead to the conversion only reading in and including those particular branches and expressions. If you only need O(10) branches in an ntuple containing thousands of branches, this can be a huge speedup.
I'll improve the docs on branches
. No problem.
I believe there's also a special $Count(branch)
formula you can use as well.
For a raw cutflow I agree, but not if I'd like to multiply these events with something like their mcWeight and scale factor. Then the data are not [1, 1, 1, 1, ...] but they're weighted e.g. to [0.7, 0.8, 0.7, 0.9, ...]. Unless I'm overlooking something extremely obvious here? ๐
Ah, yes indeed. Just use arr['weight_expression'].sum()
for a weighted cutflow.
The selection actually doesn't deal too nicely with vectorial indices. When a branch electrons_pt[0] is asked for, the data is structured as follows:
[array([ 85.83097839]) array([ 174.27775574]) array([ 87.52495575]) ...,
array([ 711.8416748]) array([ 734.52056885]) array([ 107.2477417])]
That me as wrong: since I asked for a specific index, shouldn't each of these individual arrays be a float instead? Of course in a post-processing step this is pretty easy to achieve, but it's not a structure that I had expected to get back.
Can you call foo.flatten()
on that structure instead? Should probably work nicely.
This last issue is coincidentally something I've been thinking about recently. The issue is that ROOT specifies a "multiplicity" for an expression telling us about the possible number of values to expect for each entry in the tree. In this case you expect either 1 or 0 values per entry since in some entry your electrons_pt array might be empty. But at the moment root_numpy doesn't have a mechanism to specify default values for when the expression single-element array is empty for a particular tree entry. I agree we need something like this to instead produce single elements with default values instead of nested arrays which are awkward to deal with.
For now, you can use root_numpy's stretch function: http://rootpy.github.io/root_numpy/reference/generated/root_numpy.stretch.html#root_numpy.stretch
electrons_pt = stretch(arr, fields=['electrons_pt[0]'])['electrons_pt[0]']
@kratsg, @ndawe : nope, it indeed won't:
print arr["electrons_pt[0]"].flatten()
[array([ 85.83097839]) array([ 174.27775574]) array([ 87.52495575]) ...,
array([ 711.8416748]) array([ 734.52056885]) array([ 107.2477417])]
I suspected it had to do something with ROOT's internals indeed. Perhaps as an intermediate solution there could be a way for the user to specify that they're requesting a single entry from a vector, so that stretch() may be called automagically?
yup, not on nested dtype=object. Use root_numpy.stretch.
If you use the latest master of root_numpy you can also use the shorter:
electrons_pt = stretch(arr, 'electrons_pt[0]')
@kratsg if an array is dtype=object then it is already "flattened" since each element is just a PyObject pointer to whatever. numpy know's nothing about the shape/type of what is in each array element.
I wasn't aware that the array
objects were the C-type array objects and not the np.array
objects. The latter case flattens correctly, the former doesn't.
root_numpy uses dtype=object since in general the nested subarrays are variable-length, and can be doubly nested too.
stretch is a very handy function when dealing with nested subarrays for objects within events ๐
This works like a charm! Let's say that I've converted an entire tree into a structured numpy array. How do I best slice that with an additional cut? Take the cutflow example again: I can do a selection on "A > 5". If that is to be followed by a cut "B < 20" and then "C > 100", these can of course be added and I can re-run tree2array() - but then I'm doing that N times. Is there a cleverer way in pure numpy to achieve that, if the cuts are strings?
I've used https://github.com/pydata/numexpr for things like that, or even just python's builtin eval function with an appropriate setting of globals/locals.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
So if your array is foo
, then you can do something like
foo[np.where(foo.njets > 2)]
where np.where(foo.njets > 2)
is a conditional. You can also just do
np.where(np.logical_and(x >= 10, x <= 25))
as well.
If the cuts are strings, then numexpr.evaluate() or eval() are probably your best bet. This is interesting and provides some example code: https://mail.scipy.org/pipermail/scipy-user/2010-November/027276.html
Basically:
passes = numexpr.evaluate('some_very_complex_expression_string', array)
passed_events = array[passes]
@kratsg: that's easy when writing it out - but it's a bit harder if the user specifies a configuration file: that would involve some nasty cut parsing. And accessing the vectorial branches would be more hackish.
Thanks for the numexpr suggestion! That works perfectly. Now I just gotta figure out how TMath::Phi_mpi_pi() can be implemented there :)
@gbesjes -- I use numexpr for https://github.com/kratsg/Optimization which uses cutstrings that are configurable. The easiest way to incorporate Pi is pretty straightforward:
numexpr.evaluate('2*pi', {'pi': np.pi})
works just as well. Think of the dictionary you're adding in not as your data, but the namespace for your numerical expression.
@kratsg : how do you deal with vectorial branches? numexpr doesn't appear to support indexing of them.
@gbesjes you might need to first massage your array into another array with some things stretched flat or cropped to fixed length before passing to numexpr. I have a few functions to make these operations easier in personal code that I've been considering placing in root_numpy eventually. Hopefully soon!
@gbesjes - what do you mean "vectorial" branches? Vector of vectors? Numpy doesn't handle these very well unfortunately (which means numexpr can't as easily). My workaround is just to make sure everything is "flattened" when I make it, so I end up having branches like jet_pt_0
, jet_pt_1
, ...
@kratsg: a branch std::vector - like electrons_pt[]. I'll do the same thing and stretch them. Is there an automatised way to achieved this? My numpy knowledge is obviously lacking too much for what I want to do currently.
@gbesjes here is an example function from some personal code:
def subfixedlength(rec, length, fill_value=None, return_indices=False):
"""
Truncate variable-length object fields to fixed length
Cythonized version of this function will be introduced in root_numpy
If length==1 then the subarray will become a scalar.
"""
if not rec.shape[0]:
raise ValueError("cannot truncate empty structured array")
first_rec = rec[0]
if length == 1:
# make this a scalar
dtype = [(rec.dtype.names[i], first_rec[i].dtype)
for i in range(len(first_rec))]
else:
dtype = [(rec.dtype.names[i], first_rec[i].dtype, (length,))
for i in range(len(first_rec))]
out = np.empty(rec.shape[0], dtype=dtype)
if fill_value is not None:
if isinstance(fill_value, dict):
for name, value in fill_value:
out[name].fill(value)
else:
out.fill(fill_value)
indices = np.ones(rec.shape[0], dtype=bool)
idx = 0
if length == 1:
for record in rec:
if record[0].shape[0] == 0:
indices[idx] = False
else:
for ifield, field in enumerate(record):
out[idx][ifield] = field[0]
idx += 1
else:
for record in rec:
if record[0].shape[0] < length:
indices[idx] = False
for ifield, field in enumerate(record):
out[idx][ifield][:min(field.shape[0], length)] = field[:length]
idx += 1
if return_indices:
return out, indices
return out
Clearly a bit slow since it's looping in python, but I want to Cythonize this and put something like it in root_numpy soon.
root_numpy's stretch also has a return_indices argument (default False) when True will return the original indices of the elements in the subarrays which indirectly allows you to associate the object elements with the original event indices.
for [[2, 3, 1], [0, 2]]
for an object field in a structured array, stretch(a, return_indices=True)
would give [2, 3, 1, 0, 2]
and [0, 1, 2, 0, 1]
as the indices (0 repeats for each first element in the subarrays).
Hi @ndawe,
I'm afraid that that functionality is broken for me currently (crashes on the record[0].shape accessing). And it flattens all the vector branches to the same maximum length, which is not exactly desirable. It would be great to have a function in root_numpy which ensures that each column - let's say taus_eta - is truncated to whatever the maximum is for that column (or a user-specified column-dependent maximum) and which allows the user to set the type.
That is, for a certain array, something like
transform_array(array, maxima={"taus_eta" : 3, "electrons_eta" : 11}, types={"taus_eta": np.float32, "electrons_eta" : np.float32})
as that would give the user full control over the data, while massaging it into a format that numpy can take.
I'll fix that crash due to record[0].shape
. Regarding the truncation, this is ongoing work around #266 (see option 3 there)
Can you also just paste the stacktrace you saw here?