scikit-hep/root_numpy

beginner's question

matxil opened this issue · 17 comments

Hi,

I have a 2D nparray "backgr_x", created somewhere else, and now, I want to convert it to a ROOT tree. I am using this code:

print "backgr_x shape ndim size: ", backgr_x.shape, " " , backgr_x.ndim, " ", backgr_x.size
tree = array2tree(backgr_x, name='tree')
tree.Scan()

The output is:

backgr_x shape ndim size:  (200, 784)   2   156800
Traceback (most recent call last):
  File "TranslateH5ToRoot.py", line 38, in <module>
    tree = array2tree(backgr_x, name='tree')
  ...
  File "root_numpy/src/tree.pyx", line 673, in _librootnumpy.array2tree
TypeError: object of type 'NoneType' has no len()

I am guessing I have to add names to the columns in the array, but I don't know how.
The line "backgr_x.dtype.names = ('x') " does not work.

This might be incredibly naive of me, but when I do:

print "backgr_x shape ndim size: ", backgr_x.shape, " " , backgr_x.ndim, " ", backgr_x.size
backgr_x = backgr_x.view(np.recarray)
tree = array2tree(backgr_x, name='tree')

I still get the same error ("NoneType" has no len())

Did you read the documentation I linked you? You don't want a view. You need to create the array with named columns first and then that can be passed in:

>>> x = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
>>> x
array([(1.0, 2), (3.0, 4)],
      dtype=[('x', '<f8'), ('y', '<i4')])

I didn't want to go into details but I cannot create a array, I already have it. Moreover, it can have 100's of columns and thousands of rows.
Somehow, the columns should have names like x1, x2, x3 etc. in some way, or automaticlly just called 1, 2, 3,4 etc...
So: the only thing I can do is somehow translate the array that I already have into a format that then can be used to make a tree.
If that's not possible, the only other way is to write out an ASCII file, and then read it in in a C++ code and write a TTree there.

@matxil I'm still not understanding. You must have the names of the branches (columns) somehow to make a TTree. You cannot have a ttree without named columns.

Okee. These are the details. I have an image. It's huge. It might be 256 x 256 pixels. I read thousands of those images. Each image has to be put in a TTree. Indeed, each column has to get a name.
But what I have is only an array of thousand rows and (e.g.) 65536 columns.
Is there a way in Python to change that array in a format where all columns get assigned a label (e.g. 1, 2, 3, 4, etc...) and then translated into a Tree?
If that's not possible, fair enough. I can write out the array in an ascii file, read it in in a C++ program, and in C++ it perfectly easy to make a loop where I give names to each one of the columns (i.e. Tree branches) without literally typing them all in.
But in python I don't know how to make a call to Tree* tree = new Tree(....), so instead I have to call the root_numpy array2tree method, which requires an array of a certain format with named columns. So my question is how to give names to 66536 columns to an already existing array without going insane...

I don't think you should have so many (256*256) columns here. ROOT will perform really badly. I think what you're trying to do with the TTree is inefficiently designed and there are much better ways to do this -- such as storing a std::vector<std::vector<int>> in a single branch. This is something that can be done.

Yes, a vector might be another option (just a single vector would be enough I think) but - again - I know how to do that in C++ but not in python. I thought just passing an array into "array2tree" would do exactly that: put the entire (256*256) values into a single variable length branch. Apparantly not.

Yes, a vector might be another option (just a single vector would be enough I think) but - again - I know how to do that in C++ but not in python. I thought just passing an array into "array2tree" would do exactly that: put the entire (256*256) values into a single variable length branch. Apparantly not.

It will, but you need to design the numpy array first correctly to do that. The way you've designed it, it's just 2-dimensional, but you want a nested numpy array. You can also dynamically make column names if you really wanted to...

dtype=[('col{0:d}'.format(i), int) for i in range(65536)]

but I would just start with a simpler problem of storing a 4x4 into a single branch and then it's very easy to scale from there.

Okay..., I am not sure I understand what you mean, but it's late, maybe on Monday I will look at it again. For one thing, how do I add this "dtype = [(etc....)]" to my already existing array?
Anyway, thanks for your help. As I said, I am a beginner and not particularly keen on python anyway (unfortunately I don't have a choice) so maybe I just do something quick and dirty in C++ that will do the trick. I don't see much point in using ROOT in python, but as I said, I don't have a choice in this.

You can already do it pretty easily if you have your data in python..

>>> x = np.array(imagedata, dtype=[('col{0:d}'.format(i), int) for i in range(65536)])

Okay, that would even work if imagedata is an np.array already? Cool, that is exactly what I want then. I am sorry that my first question was not very clear, but since I am not familiar with python and numpy, I didn't really know the possibilities. Many thanks for your help

Yes. You should definitely learn numpy + python for sure.

Well, I am hoping that soon my project will move back to the beautiful clean world of C++ but, yeah, I guess you're right that just in case, I will have to learn a bit more python too

I am sorry, one last question. If instead of having 256 x 256 columns each with a different name, I just want one single list of values (so, a thousand images would end up as thousand entries in the Tree, and each entry would have one single variable (e.g. "*pixel_array)) which would be a list of 256 x 256 pixel values.
How would I do that?

Flatten the 256x256 into a single 1D vector of length 65536. So something like this:

>>> import numpy as np
>>> import root_numpy as rnp
>>> image_data = np.arange(0, 5*5).reshape(5, 5)
>>> data = np.array([(0, image_data.reshape(-1)),], dtype=[('image_id', 'int'), ('image_data', 'int', image_data.size)])
>>> tree = rnp.array2tree(data)
>>> tree.Print()
******************************************************************************
*Tree    :tree      : tree                                                   *
*Entries :        1 : Total =            1862 bytes  File  Size =          0 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :image_id  : image_id/L                                             *
*Entries :        1 : Total  Size=        679 bytes  One basket in memory    *
*Baskets :        0 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    1 :image_data : image_data[25]/L                                      *
*Entries :        1 : Total  Size=        891 bytes  One basket in memory    *
*Baskets :        0 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

I did a 5x5 for demonstration purposes.

Wow, I am amazed about how complicated this all is. I would never ever have figured this out. Many thanks, I will try this out next week but it looks like it's exactly what I want.
At least the "image_data" looks like what I want. I can play with it next week and see what happens.
Thank you very much for your help!