scikit-hep/root_pandas

read_root with large number of input files scales badly

Nargoth-zz opened this issue · 1 comments

Hi,

I am using read_root to get dataframes which contain chunks (100000 rows) of my data from a large TChain (multiple thousand files). Thereby I noticed that the function scales badly with the number of files that is passed to it for creating the TChain. I used cprofile to test this hypothesis - attached are two screenshots of the output. The one does refer to using my total data sample while the other uses only 5% of my data but with equal chunksize. You can see that the time per call of genchunks is around 8 times as high as the reference 5% sample.

I am able to bypass this behaviour by creating chunks of files before passing them to read_root
and was wondering if this is a bug in root_pandas or root2array.

Cheers!
5percent_data
all_data

As you may be aware, this package has been superseded by https://github.com/scikit-hep/uproot. I'm closing this task as "won't do" given what uproot provides.