Jagged indexing doesn't work for ChunkedArrays
Closed this issue · 11 comments
This is somewhat similar to the problem #180. The problem here is that a ChunkedArray made out of JaggedArrays cannot be indexed by JaggedArrays:
>>> import awkward
>>> j = awkward.fromiter([[1]])
>>> j
<JaggedArray [[1]] at 0x7f68a3d71f60>
>>> c = awkward.ChunkedArray([j])
>>> c
<ChunkedArray [[1]] at 0x7f68b14497f0>
>>> mask = c.ones_like().astype(bool)
>>> mask
<ChunkedArray [[True]] at 0x7f68a3c85550>
>>> c[mask]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/chunked.py", line 480, in __getitem__
raise TypeError("cannot interpret shape {0}, dtype {1} as a fancy index or mask".format(head.shape, head.dtype))
TypeError: cannot interpret shape (1, 1), dtype object as a fancy index or mask
This of course works just fine for standard JaggedArrays. I run into this problem with uproot because it's much more convenient to use lazyarrays()
(which returns ChunkedArrays), but I can't use them with this problem.
+1 I would like to support this request
That's understandable, but I'm in a bit of a bind: I can either fix bugs on this version of awkward-array or make progress on the version that will replace it. Notice that all of the open issues that aren't enhancements involve ChunkedArray
: this is a difficult one to get right in the present framework.
The redesign represented by awkward-1.0 is directly addressing this—the "layout" (e.g. JaggedArray
vs IndexedArray
of JaggedArray
vs ChunkedArray
of JaggedArray
) will become an implementation detail inside a general Array
class that abstracts over different layouts providing the same "high-level type" (e.g. jagged vs not-jagged). Specifically for this case, __getitem__
won't be asking whether the object in square brackets is a JaggedArray
or a ChunkedArray
of JaggedArray
or a ChunkedArray
of ChunkedArray
of JaggedArray
(there's an infinite set of possibilities); it will be asking if the high-level type is jagged. When ChunkedArrays
report their type, they report the type of their contents.
Any work that I do to fix this version is a temporary patch, like fixing Python 2.7 in the era of Python 3 (except that I'm a team of one). Given that this would be a temporary solution anyway, could you do something like
array = awkward.concatenate(array.chunks)
to turn your ChunkedArray
into a non-chunked array? Or do you run out of memory because it's a ChunkedArray
of VirtualArray
(i.e. a "lazy" array) and you really need the laziness? In uproot, iterate
is a more explicit way of walking over chunks than lazyarray
.
(Again, I understand that you have an immediate physics problem to solve and can't wait for pie-in-the-sky future developments; I'm just trying to find the best trade-offs. Sometimes you have to give up on some things in the present to reduce technical debt in the future.)
From my side this is for development / testing of a future fwk so it is not needed immediately and I can wait. Do you have an idea of the timeline for the new version to be ready to use?
P.s My use case is with lazyarrays (ChunkedArray of VirtualArray) since only lazy array has a profile method to pass in your own filling; iterate doesn't have this. Would it make sense to add this profile method to iterate?
In that case, it would be better to wait because you don't want to develop a framework around something that will change this much.
When I started a month ago, I said I expected it to take 6 months, and that still seems like a reasonable estimate for physics users. I was also trying to get a minimally testable product out in one month, and I'm close to what I meant by that, though a potential user would have to ask, "What is meant by minimal?"
What I expect to finish in a week is:
awkward1.layout.NumpyArray
,awkward1.layout.ListArray
,awkward1.layout.ListOffsetsArray
(which together represent jagged arrays), andawkward1.layout.TensorArray
(which represents a regularly "shaped" array object, but not necessarily a Numpy array), with__iter__
and__getitem__
only (and only Numpy-like semantics; no jagged arrays in the square brackets yet).
This is the minimum set of layout classes that are needed to implement __getitem__
with Numpy-like semantics (beyond only NumpyArray
, which wouldn't be interesting in itself because it just reproduces what Numpy can already do). I don't foresee wrapping these in a user-friendly awkward1.Array
soon, though that is how they will eventually be used.
After that, I have to make a decision:
- Update the Numba wrappers, to keep Numba and C++ development in sync, ensuring that I don't make any C++ designs that would be hard to carry over to Numba?
- Implement
awkward1.fromiter
, since this is useful for making tests and demonstrating what awkward-array is good for.
Beyond those two, I'll be addressing RecordArray
(the new name for Table
). ChunkedArrays
are quite a bit further down the list.
P.s My use case is with lazyarrays (ChunkedArray of VirtualArray) since only lazy array has a profile method to pass in your own filling; iterate doesn't have this. Would it make sense to add this profile method to iterate?
Oh! That's different—you're not interested in ChunkedArrays
at all. You should be able to do the awkward.concatenate(array.chunks)
trick to turn the lazy array into an eager array with no chunks.
Also, yes, there should be a profile option for eager arrays in uproot (tree.array
, tree.arrays
, tree.iterate
, uproot.iterate
...). In fact, that would be simpler to implement because it wouldn't involve all the indirection of trying to keep VirtualArrays
from being prematurely materialized. It's just a matter of changing names and using IndexedArray
to point to associated collections (e.g. jets associated with a muon). The biggest conceptual difficulty with that is that a user can ask for subsets of arrays, which might or might not include the associated collections you want to link across.
Maybe that shouldn't be something that uproot provides (because the eager-array uproot functions all have an interface that lets the user choose which arrays to read) but a renaming facility that happens afterward. That is, a function that looks at a dict of arrays and says, "If there are any named "Muon_(.*)"
, then JaggedArray.zip
them into a collection named "muons"
with "\1"
as subnames," and "If "Muons_JetIdx"
and "Jets_(.*)"
are both present, then make the IndexedArray
to associate one with the other." That's what the lazy-array profile is doing, except that all arrays are always available in VirtualArray
form, and they have to be turned into new VirtualArrays
that do this renaming on the fly. A separate function would just be a suite of if-thens that connect and rename arrays.
Thanks for the info. I am only starting to play with this and this sounds interesting indeed. I want to make essentially a class-like table of event.muons.variable etc, like you do for the CMS nanoAOD but for an ATLAS format. Ideally, at the moment I think that on the "event" I want to be able to select events using the muons.variables etc but on the "muons" I also want to be able to select muons using muons.variables to make skimmed collections of muons, if that makes sense. I am going to PyHEP so maybe I will talk to you there about this some more. In the meantime I might have a go at such a function and any pointers as to how this should look (since I am not yet so familiar) would be useful.
That sounds great! Check into the coffea project, which has a kind of "DataFrame" for awkward-arrays, and it's not supposed to be exclusively CMS. You might find it an easier place to start generalizing.
An ATLAS profile would be very welcome, though as you can see we need better ChunkedArrays
for the lazy profiles to be useful. Fortunately, the idea of this renaming can be separated from the idea of laziness.
However, do you know if you can read ATLAS data files in uproot? I had a lot of trouble trying to parse xAOD and DxAOD samples sent to me from Attila. Even the most understandable branches had types like std::vector<std::vector<X>>
, which are not serialized by ROOT in a Numpy-friendly way (no way in Python to deserialize them without a for loop).
But we can definitely talk at PyHEP. See you there!
Thanks.
I should be clear this is more as a personal project and aimed at a further derived dataset from the DxAOD.
@masonproffitt Here's a better example, though your example works, too:
import awkward
j = awkward.fromiter([[1, 2, 3], [], [4, 5]]) # ensure that the array is really jagged
c = awkward.ChunkedArray([j, j]) # let's have more than one chunk
mask = (c % 2 == 0) # a mask that isn't all True
mask
# → <ChunkedArray [[False True False] [] [True False] [False True False] [] [True False]]>
c[mask]
# → ChunkedArray [[2] [] [4] [2] [] [4]]>