scikit-hep/uproot5

uproot4 to uproot5 TTree to DataFrame arrays slower

masterfelu opened this issue · 14 comments

Hello.

I run a script to load a TTree as a DataFrame with 5 million rows and 3 columns, and measure the time taken. Here's an MWE for ipython:

%timeit uproot.open('tree.root')['time'].arrays(['year','month','day'],library='pd')

For uproot 5.3.7 with pandas 2.2.2, 19.2 s ± 177 ms per loop
For uproot 4.1.9 with pandas 1.3.5, 1.61 s ± 9.27 ms per loop

The value reported is mean ± std. dev. of 7 runs, 1 loop each. We can see uproot5 is 4 times slower than uproot4. The discrepancy increases further when more columns are loaded, leading to more than ten minutes of time for what previously took seconds.

I just want to know if there has been any major change that can cause such a reduction of load time for large TTree to DataFrame. Also, if any more checks should be done before I reach a conclusion, that is super helpful.

Thanks a lot for your time.

It might also be important to mention the differences in the pandas versions in order to reproduce the results?

I've also noticed the same issue with uproot.open('someNANOAOD12.root:Events').arrays(lib='ak'). This file has 680 entries and a size of 4.053MB. I am timing only the operation of getting the arrays for the tree object, and it takes 20.8 seconds.
I never had such issues before with much larger files with some older version of uproot (I think it was still 5 but a particular release). I am now using version 5.4.1. I am not sure if it's an issue with uproot or awkward. I am leaning toward it being an uproot error since I've had my awkward version fixed in virtual env.