scikit-hep/uproot3

AttributeError: no column named 'reshape' for ChunkedArray with jagged content

raymondEhlers opened this issue · 4 comments

When attempting to process a ChunkedArray containing jagged values loaded via uproot.lazyarrays(...), I receive: AttributeError: no column named 'reshape'. The full traceback can be seen in the following example:

In [14]: arrays["data_z"]
Out[14]: <ChunkedArray [[0.22628734 0.094208576 0.11197069 0.23886986] [0.2817931 0.02485017 0.31829283 ... 0.37544665 0.39175743 0.063571155] [0.084453866 0.022292202] ... [0.0815998 0.48151806 0.27003774 0.35759225 0.47045997 0.29828948] [0.049144566 0.26100245 0.37040257 ... 0.11787954 0.3755424 0.45226774] [0.4637775 0.3404903 0.3402615 ... 0.20760414 0.16756321 0.47770718]] at 0x00010f6814d0>

In [15]: np.sin(arrays["data_z"]) / arrays["data_z"]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
test.py in <module>
----> 1 np.sin(arrays["data_z"]) / arrays["data_z"]

.venv/lib/python3.7/site-packages/numpy/lib/mixins.py in func(self, other)
     23         if _disables_array_ufunc(other):
     24             return NotImplemented
---> 25         return ufunc(self, other)                                                                                           
     26     func.__name__ = '__{}__'.format(name)
     27     return func

.venv/lib/python3.7/site-packages/awkward/array/chunked.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    560         types = {}
    561         for batch in batches:
--> 562             result = getattr(ufunc, method)(*batch, **kwargs)
    563
    564             if isinstance(result, tuple):

.venv/lib/python3.7/site-packages/awkward/array/chunked.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    560         types = {}
    561         for batch in batches:
--> 562             result = getattr(ufunc, method)(*batch, **kwargs)
    563
    564             if isinstance(result, tuple):

.venv/lib/python3.7/site-packages/awkward/array/jagged.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1025                         return content
   1026
-> 1027                 content = recurse(data)
   1028
   1029                 inputs[i] = self.JaggedArray(starts, stops, content)

.venv/lib/python3.7/site-packages/awkward/array/jagged.py in recurse(x)
   1014                             content = self.numpy.full(len(parents), x, dtype=x.dtype)
   1015                         else:
-> 1016                             content = x.reshape(-1)[parents]
   1017                         return content
   1018

.venv/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
    254                     raise AttributeError("while trying to get column {0}, an exception occurred:\n{1}: {2}".format(repr(where), type(err), str(err)))
    255             else:
--> 256                 raise AttributeError("no column named {0}".format(repr(where)))
    257
    258     def __dir__(self):

AttributeError: no column named 'reshape'

I can cause it with some operations (such as the above), but not in every case (simple division works, but dividing three terms fails). As the traceback hints, it seems to be related to processing chunks - if I load the data using arrays(...), the operation works fine. Perhaps I've missed a detail, but I'm at a loss why this doesn't work. Any suggestions would be greatly appreciated!

You're seeing something real, but I'm going to recommend sticking with non-lazy arrays. The reason it's failing is because some jagged operations assumed that the contents are NumPy-like (with a reshape method), but chunked arrays fail to satisfy this contract in some ways. (Lazy arrays are "chunked virtual": the chunking is for files/baskets and virtual for delayed reading.) Some of these bugs have already been patched, but it's a deep rabbit hole.

It was exactly this inconsistency that motivated a complete rewrite of Awkward Array, which is nearly finished. I'm scheduled to port Uproot to the new Awkward Array in April. The quickest way to fix all of these issues is probably to focus on the update.

Lazy arrays are a more implicit way to do what uproot.iterate does. Are you able to get your analysis work done with that?

Thanks for your super rapid and detailed response! (as always!)

To back up a bit, I ran into this issue because I was trying to work around my other issue related to doubly jagged arrays requiring a python loop. My goal was to calculate them once, and then store them in HDF5. The persistvirtual functionality of lazyarrays seemed ideal, so I was trying to use that to avoid reinventing the wheel.

Since this won't work until uproot 4 / awkward 1, I think I can use iterate for my analysis - I'll just work around it by calculating and converting everything to HDF5 straightaway.

Thanks for all your efforts! Even with hitting a few edge cases, these packages are tremendously useful! From my perspective, this can be closed, but I'll leave it up to you to decide since it won't be resolved until uproot 4.

Thanks for letting me know! It will be solved in a way that won't necessarily be recognizable: jagged array implementations simply won't be using functions like reshape, so I might want to close it now so that I don't forget or have to reevaluate whether it's relevant. On the other hand, other users might encounter this bug and I want to be l make it easier for them to find this recommendation, so for that reason, I'll leave it open until Uproot 4.

I've decided to close it because I'm trying to figure out what, exactly, is outstanding.