scikit-hep/root_numpy

tree2array: propagating statistical errors

gbesjes opened this issue · 3 comments

Hi,

I'm having a conceptual problem with dealing with statistical errors.

Let's say that I have some tree with a bunch of events. I need to select certain events and then pass them into a histogram. However, I have plenty of shared cuts: effectively, I'm trying to divide one sample into a pass and fail category.

Selecting the brach to split on with the pass/fail in tree2array and then using numpy's arrays is much faster than e.g. running TTree::Draw twice: i.e.

pass = arr[ arr['foo'] == 0]
fail = arr [ arr['foo'] == 1]

However, how to I assign the statistical errors properly to the histograms that I create this way? Is there a simple procedure to keep track of this somehow that I have completely overlooked?

Cheers,
Geert-Jan

Just a side note that you can definitely pass in a selection straight into the root2array/tree2array functions (https://rootpy.github.io/root_numpy/reference/generated/root_numpy.root2array.html) so you can just only grab those entries you care to TTree::Draw.

Depending on how you assign the errors, it might be easy to do errors = calculate_error(arr['branch_of_interest']) which creates an array and then loop over that to set the errors separately:

   for(int i=0; i<11; i++){
      h1->SetBinContent(i,1.5-i/10);
      h1->SetBinError(i,0.5*i);
      h2->SetBinContent(i,10.5-i/10);
      h2->SetBinError(i,0.7*i);
   }

for example.

Hi @kratsg ,

Thanks for the very quick reply! I'm aware that I can do a selection, and I'm fact I'm using that already :-)

However, subdividing a sample 4-way (real-pass, real-fail, fake-pass, and fake-fail) is much slower than selecting using only the more inclusive kinematic cuts and then doing that split on the numpy arrays.

I'm not sure that I understand your example. What is this meant to do?

ndawe commented

@gbesjes the statistical errors will still be set appropriately by ROOT when you fill your arrays into your histograms, no matter what path those arrays took on their way to being chopped up into your various analysis regions. Just use fill_hist on each slice, as you would normally, and those histograms will have the statistical errors set correctly. fill_hist also accepts weights, that should be sliced up the same way as your main array.