scikit-hep/awkward-0.x

Jagged indexing doesn't work for ChunkedArrays

Closed this issue · 11 comments

This is somewhat similar to the problem #180. The problem here is that a ChunkedArray made out of JaggedArrays cannot be indexed by JaggedArrays:

>>> import awkward
>>> j = awkward.fromiter([[1]])
>>> j
<JaggedArray [[1]] at 0x7f68a3d71f60>
>>> c = awkward.ChunkedArray([j])
>>> c
<ChunkedArray [[1]] at 0x7f68b14497f0>
>>> mask = c.ones_like().astype(bool)
>>> mask
<ChunkedArray [[True]] at 0x7f68a3c85550>
>>> c[mask]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/chunked.py", line 480, in __getitem__
    raise TypeError("cannot interpret shape {0}, dtype {1} as a fancy index or mask".format(head.shape, head.dtype))
TypeError: cannot interpret shape (1, 1), dtype object as a fancy index or mask

This of course works just fine for standard JaggedArrays. I run into this problem with uproot because it's much more convenient to use lazyarrays() (which returns ChunkedArrays), but I can't use them with this problem.

+1 I would like to support this request

That's understandable, but I'm in a bit of a bind: I can either fix bugs on this version of awkward-array or make progress on the version that will replace it. Notice that all of the open issues that aren't enhancements involve ChunkedArray: this is a difficult one to get right in the present framework.

The redesign represented by awkward-1.0 is directly addressing this—the "layout" (e.g. JaggedArray vs IndexedArray of JaggedArray vs ChunkedArray of JaggedArray) will become an implementation detail inside a general Array class that abstracts over different layouts providing the same "high-level type" (e.g. jagged vs not-jagged). Specifically for this case, __getitem__ won't be asking whether the object in square brackets is a JaggedArray or a ChunkedArray of JaggedArray or a ChunkedArray of ChunkedArray of JaggedArray (there's an infinite set of possibilities); it will be asking if the high-level type is jagged. When ChunkedArrays report their type, they report the type of their contents.

Any work that I do to fix this version is a temporary patch, like fixing Python 2.7 in the era of Python 3 (except that I'm a team of one). Given that this would be a temporary solution anyway, could you do something like

array = awkward.concatenate(array.chunks)

to turn your ChunkedArray into a non-chunked array? Or do you run out of memory because it's a ChunkedArray of VirtualArray (i.e. a "lazy" array) and you really need the laziness? In uproot, iterate is a more explicit way of walking over chunks than lazyarray.

(Again, I understand that you have an immediate physics problem to solve and can't wait for pie-in-the-sky future developments; I'm just trying to find the best trade-offs. Sometimes you have to give up on some things in the present to reduce technical debt in the future.)

From my side this is for development / testing of a future fwk so it is not needed immediately and I can wait. Do you have an idea of the timeline for the new version to be ready to use?

P.s My use case is with lazyarrays (ChunkedArray of VirtualArray) since only lazy array has a profile method to pass in your own filling; iterate doesn't have this. Would it make sense to add this profile method to iterate?

In that case, it would be better to wait because you don't want to develop a framework around something that will change this much.

When I started a month ago, I said I expected it to take 6 months, and that still seems like a reasonable estimate for physics users. I was also trying to get a minimally testable product out in one month, and I'm close to what I meant by that, though a potential user would have to ask, "What is meant by minimal?"

What I expect to finish in a week is:

  • awkward1.layout.NumpyArray, awkward1.layout.ListArray, awkward1.layout.ListOffsetsArray (which together represent jagged arrays), and awkward1.layout.TensorArray (which represents a regularly "shaped" array object, but not necessarily a Numpy array), with __iter__ and __getitem__ only (and only Numpy-like semantics; no jagged arrays in the square brackets yet).

This is the minimum set of layout classes that are needed to implement __getitem__ with Numpy-like semantics (beyond only NumpyArray, which wouldn't be interesting in itself because it just reproduces what Numpy can already do). I don't foresee wrapping these in a user-friendly awkward1.Array soon, though that is how they will eventually be used.

After that, I have to make a decision:

  • Update the Numba wrappers, to keep Numba and C++ development in sync, ensuring that I don't make any C++ designs that would be hard to carry over to Numba?
  • Implement awkward1.fromiter, since this is useful for making tests and demonstrating what awkward-array is good for.

Beyond those two, I'll be addressing RecordArray (the new name for Table). ChunkedArrays are quite a bit further down the list.

P.s My use case is with lazyarrays (ChunkedArray of VirtualArray) since only lazy array has a profile method to pass in your own filling; iterate doesn't have this. Would it make sense to add this profile method to iterate?

Oh! That's different—you're not interested in ChunkedArrays at all. You should be able to do the awkward.concatenate(array.chunks) trick to turn the lazy array into an eager array with no chunks.

Also, yes, there should be a profile option for eager arrays in uproot (tree.array, tree.arrays, tree.iterate, uproot.iterate...). In fact, that would be simpler to implement because it wouldn't involve all the indirection of trying to keep VirtualArrays from being prematurely materialized. It's just a matter of changing names and using IndexedArray to point to associated collections (e.g. jets associated with a muon). The biggest conceptual difficulty with that is that a user can ask for subsets of arrays, which might or might not include the associated collections you want to link across.

Maybe that shouldn't be something that uproot provides (because the eager-array uproot functions all have an interface that lets the user choose which arrays to read) but a renaming facility that happens afterward. That is, a function that looks at a dict of arrays and says, "If there are any named "Muon_(.*)", then JaggedArray.zip them into a collection named "muons" with "\1" as subnames," and "If "Muons_JetIdx" and "Jets_(.*)" are both present, then make the IndexedArray to associate one with the other." That's what the lazy-array profile is doing, except that all arrays are always available in VirtualArray form, and they have to be turned into new VirtualArrays that do this renaming on the fly. A separate function would just be a suite of if-thens that connect and rename arrays.

Thanks for the info. I am only starting to play with this and this sounds interesting indeed. I want to make essentially a class-like table of event.muons.variable etc, like you do for the CMS nanoAOD but for an ATLAS format. Ideally, at the moment I think that on the "event" I want to be able to select events using the muons.variables etc but on the "muons" I also want to be able to select muons using muons.variables to make skimmed collections of muons, if that makes sense. I am going to PyHEP so maybe I will talk to you there about this some more. In the meantime I might have a go at such a function and any pointers as to how this should look (since I am not yet so familiar) would be useful.

That sounds great! Check into the coffea project, which has a kind of "DataFrame" for awkward-arrays, and it's not supposed to be exclusively CMS. You might find it an easier place to start generalizing.

An ATLAS profile would be very welcome, though as you can see we need better ChunkedArrays for the lazy profiles to be useful. Fortunately, the idea of this renaming can be separated from the idea of laziness.

However, do you know if you can read ATLAS data files in uproot? I had a lot of trouble trying to parse xAOD and DxAOD samples sent to me from Attila. Even the most understandable branches had types like std::vector<std::vector<X>>, which are not serialized by ROOT in a Numpy-friendly way (no way in Python to deserialize them without a for loop).

But we can definitely talk at PyHEP. See you there!

Thanks.

I should be clear this is more as a personal project and aimed at a further derived dataset from the DxAOD.

@masonproffitt Here's a better example, though your example works, too:

import awkward

j = awkward.fromiter([[1, 2, 3], [], [4, 5]])  # ensure that the array is really jagged
c = awkward.ChunkedArray([j, j])               # let's have more than one chunk
mask = (c % 2 == 0)                            # a mask that isn't all True
mask
# → <ChunkedArray [[False True False] [] [True False] [False True False] [] [True False]]>

c[mask]
# → ChunkedArray [[2] [] [4] [2] [] [4]]>

PR #193, will be part of 0.12.11.