scikit-hep/awkward-0.x

Can't use gather operation on a filtered array

beojan opened this issue · 28 comments

I have code that looks like the following:

jets = tree['jets'][have_four_jets]
quarks = tree['quarks'][have_four_jets]

assoc = tree['jet_quark_assoc'][have_four_jets]
assoc = assoc[assoc != 99]

assoc_quarks = quarks[assoc]

However, np.size(quarks.content) != np.size(assoc.content) since quarks.content still points to the original unfiltered array, so this gives the error

IndexError: some indexes out of bounds for length 200000

where 200000 is the length of quarks.content.

I don't know enough about quarks and assoc to know if there is a problem. Are they both JaggedArrays? (I would assume so because you're talking about their .content).

If the filter have_four_jets is not jagged, then it should not change the .content of the jagged array to which it is applied. Flat filters should replace only the starts and stops, leaving content alone.

Can you get an error for a small slice of 10 or so events (first 10, last 10, somewhere in between) so that you can print them out? I always print out slices of my arrays to solve indexing problems. If you can't quarks[assoc], then you can't (quarks[s])[assoc[s]] for some slice s, and finding an example slice will make it very clear what's wrong.

Oh! Hey! It just took a bit longer for me to think about it—there could be a bad assumption (i.e. assuming contiguous starts and stops) in the implementation of __getitem__ by jagged integer indexes that makes it fail (as above) for arrays that should logically work.

Try using .compact() on all of your jagged arrays, both the quarks and the assoc. If that makes it work, then that's the bad assumption and we need to do .compact() on either the quarks or the assoc (or both) in __getitem__.

What .compact() does is to simplify the starts and stops, making them contiguous, and filters the content to give you a JaggedArray with the same meaning. That's exactly what you were asking for. (It should not happen automatically when you [have_four_muons], but it should be done internally by code that wants to assume contiguous starts and stops, and we may just be missing test cases for that.)

That gives the error:

some indexes out of bounds for length 156264

156264 is the length of both .content arrays.

EDIT: In fact, using the slice 0:33 gives this error, while 0:32 works.

So it's not just a matter of the algorithm assuming contiguous starts and stops, because you provided that with .content() and it still fails.

My original thought was that it's an error in the values of the assoc index—look at the values in element 33. You might see why it shouldn't work. At least print them out—we can't tell what's going on without knowing the values and the structure of these arrays.

I did print both element 32 and 33 of both arrays, and they both look fine:

np.size(q.content):  156264 , np.size(assoc.content):  156264

q[32]:  [TLorentzVector(-5.0028, 30.303, -46.777, 56.156) TLorentzVector(-84.361, -100.2, -64.974, 146.29) TLorentzVector(145.15, -84.904, 78.628, 185.69) TLorentzVector(35.454, -13.55, -43.099, 57.621)]
q[33]:  [TLorentzVector(75.276, 81.816, 6.6091, 111.47) TLorentzVector(-12.861, 152.63, -71.253, 169) TLorentzVector(-11.143, -124.26, 170.76, 211.53) TLorentzVector(2.1352, -33.484, -20.679, 39.692)]

assoc[32]:  [2 1 3 0] , assoc[33]:  [1 0 3 2]

Are there two implementations somewhere, one for arrays with 32 or fewer elements and one for arrays with more?

No, there aren't. It's one implementation for all sizes. 32 isn't at a conspicuous integer size boundary, like 256, either. (5-bit integers aren't an option.)

The reason for printing out an example of what fails is for me to try to reproduce it—I can't do much without that kind of information. This is the first that I learned that they're TLorentzVectors, for instance.

I'm trying to reproduce the effect by simulating each of your steps, but I'm not finding any errors:

import awkward

# "a" is like "quarks"
a = awkward.fromiter([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8], [9.9]])
"i" is like "assoc"
i = awkward.fromiter([[1, 0, 0], [], [1], [3, 3, 0, 0], [0, 0, 0]])
# "m1" is like "have_four_jets"
m1 = awkward.fromiter([True, True, False, False, True])
# "m2" is like "assoc != 99" (after filtering for "have_four_jets"
m2 = awkward.fromiter([[True, False, True], [], [False, False, True]])

# filter like "have_four_jets"
am1 = a[m1]
im1 = i[m1]
# am1 is <JaggedArray [[0.0 1.1 2.2] [] [9.9]] at 0x7d961115ed30>
# im1 is <JaggedArray [[1 0 0] [] [0 0 0]] at 0x7d961115ee80>

# filter like "assoc != 99"
im2 = im1[m2]
# im2 is <JaggedArray [[1 0] [] [0]] at 0x7d961116cd30>

# in the end,
# a[i] is <JaggedArray [[1.1 0.0 0.0] [] [4.4] [8.8 8.8 5.5 5.5] [9.9 9.9 9.9]] at 0x7d961116c908>
# am1[im1] is <JaggedArray [[1.1 0.0 0.0] [] [9.9 9.9 9.9]] at 0x7d961116c2e8>
# am1[im2] is <JaggedArray [[1.1 0.0] [] [9.9]] at 0x7d961116ca58>

All of this is as expected—not just the lack of errors, but the specific values are what we expect them to be. (You don't deal in negative indexes, do you? That's why I didn't test them in this example.) I was thinking that if there was an error in indexing, even though I'm working with small array sizes, I would see some off-by-one error here that accumulates to an out-of-bounds in your example. No such luck.

I think I'm going to need to see the original file, and that might delay by another day because I'm travelling.

I found the same thing with plain numbers. I'm trying to reproduce with TLorentzVectors but I'm having trouble making a JaggedArray out of a TLorentzVectorArray.

I didn't mean that I think it's relevant that they're TLorentzVectors, just illustrating how little I know about your example. Sorry for the confusion.

(Anyway, the way to make a jagged array of TLorentzVectors is to pass JaggedArrays as the components in TLorentzVectorArray.from_ptetaphim or one of the other constructors...)

Also, I noticed that the implementation,

https://github.com/scikit-hep/awkward-array/blob/master/awkward/array/jagged.py#L529-L548

is doing .tocompact() as a first step (through offsets derived from counts, passed to _tojagged with an underscore, which was how I did it before .tocompact() was introduced).

You didn't tell me on which line number you get the error. I might be able to figure it out from that without downloading the whole file. Which line is it?

IndexError                                Traceback (most recent call last)
<ipython-input-41-091abd66002d> in <module>
----> 1 quarks[truth_assoc]

/usr/lib/python3.7/site-packages/awkward/array/jagged.py in __getitem__(self, where)
    546                 indexes += head.tojagged(self._starts)._content
    547 
--> 548                 return self.copy(starts=head._starts, stops=head._stops, content=self._content[indexes])
    549 
    550             elif len(head.shape) == 1 and issubclass(head._content.dtype.type, (self.numpy.bool, self.numpy.bool_)):

/usr/lib/python3.7/site-packages/awkward/array/objects.py in __getitem__(self, where)
    188         head, tail = where[0], where[1:]
    189 
--> 190         content = self._content[head]
    191         if self._util_isinteger(head):
    192             if isinstance(tail, tuple) and tail == ():

/usr/lib/python3.7/site-packages/awkward/array/table.py in __getitem__(self, where)
    610             raise NotImplementedError("multidimensional index through a Table (TODO: needed for [0, n) -> [0, m) -> \"table\" -> ...)")
    611 
--> 612         newslice = self._newslice(head)
    613 
    614         if self._util_isinteger(newslice):

/usr/lib/python3.7/site-packages/awkward/array/table.py in _newslice(self, head)
    523                     head[negative] += length
    524                 if not self.numpy.bitwise_and(0 <= head, head < length).all():
--> 525                     raise IndexError("some indexes out of bounds for length {0}".format(length))
    526 
    527                 if self._view is None:

IndexError: some indexes out of bounds for length 200000

What's the stack trace when you do it with plain numbers?

None. It works with plain numbers.

That is, when I said "same thing" I meant same thing as you.

Oh, I see. Then JaggedArray.__getitem__ is probably the wrong place to look... (That's where I had been looking.)

It's probably in Table._newslice(head).

So it's probably true that quarks.mass[assoc] succeeds while quarks[assoc] fails, right? (So it is relevant that these are Tables, though not necessarily TLorentzVector.)

Yes. quarks.mass[assoc] works.

I've thought about a few other questions I can ask to try to remotely solve it ("debugging via telegraph"), but it would be easier if you could just send me the file. I assume that your original example

quarks = tree['quarks'][have_four_jets]
assoc = tree['jet_quark_assoc'][have_four_jets]
assoc = assoc[assoc != 99]
quarks[assoc]

is sufficient to reproduce the error if tree is the "Events" tree in this file? How is have_four_jets defined? Like this?

have_four_jets = (tree['jets'].counts >= 4)

Actually it's (assoc !=99).count_nonzero() != 4. (I.e. do we have at least four associated jets).

What I'm trying to do is get the quarks in the same order (in each event) as the jets, while removing events where I don't have four associated jets. quarks always has four TLorentzVectors per event.

Anyway, I get the same error even if I remove the have_four_jets complication.

That sounds good, I'll use that to reproduce your error. But I still need the file.

I'm in a session of talks for the next 2 hours. The wifi here seems okay, so as long as the file is 10s or 100s of MB, it should be okay (I only need three branches). After this session, I'll be flying and out of contact, so I'll need the file soon or I'll have to leave this for tomorrow.

You know, if the file is too large, you can pull out the relevant data and save them in awkward-array's native file format:

import uproot, awkward
tree = uproot.open("~/irishep/uproot/tests/samples/HZZ-objects.root")["events"]

# turn the TLorentzVectorArray into a generic Table
muonp4 = tree.array("muonp4")
muonp4 = awkward.JaggedArray.zip(pt=muonp4.pt, eta=muonp4.eta, phi=muonp4.phi, mass=muonp4.mass)

# or if the cartesian coordinates are more numerically stable
muonp4 = tree.array("muonp4")
muonp4 = awkward.JaggedArray.zip(px=muonp4.x, py=muonp4.y, pz=muonp4.z, E=muonp4.t)
jetp4 = tree.array("jetp4")
jetp4 = awkward.JaggedArray.zip(px=jetp4.x, py=jetp4.y, pz=jetp4.z, E=jetp4.t)

# collect them into an event Table
events = awkward.Table(muons=muonp4, jets=jetp4)

# and save it
awkward.save("smaller.awkd", events)

# this is how you get it back
events2 = awkward.load("smaller.awkd")

# see?
events[0].tolist(), events2[0].tolist()
# {'muons': [{'px': -52.89945602416992, 'py': -11.654671669006348, 'pz': -8.16079330444336, 'E': 54.77949905395508}, {'px': 37.7377815246582, 'py': 0.6934735774993896, 'pz': -11.307581901550293, 'E': 39.401695251464844}], 'jets': []}
# {'muons': [{'px': -52.89945602416992, 'py': -11.654671669006348, 'pz': -8.16079330444336, 'E': 54.77949905395508}, {'px': 37.7377815246582, 'py': 0.6934735774993896, 'pz': -11.307581901550293, 'E': 39.401695251464844}], 'jets': []}

The current theory is that it's the Table that's causing the issue, not the TLorentzVector (which is a Table with extra stuff). If so, I should be able to reproduce the error from the "smaller.awkd" file. If not, then I'll have to build the TLorentzVectorArray, which is not a problem, either.

Here're are the columns in parquet format.

test.zip

Can you do them as awkd files? (Just replace the awkward.toparquet with awkward.save. The latter is one-to-one with the original awkward arrays.) The Parquet file format introduces masking at every level that I have to remove to get back to your original example. (The masks are masking out nothing, but structure is important for this bug.)

I've attempted to reconstruct your data like this:

x = awkward.fromparquet("quark_x.parquet").chunks[0].array
i = awkward.fromparquet("assoc.parquet").chunks[0].array
x = awkward.JaggedArray(x.content.starts, x.content.stops, x.content.content.content)
i = awkward.JaggedArray(i.content.starts, i.content.stops, i.content.content.content)

x.layout
#  layout 
# [    ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [     0]   ndarray(shape=50000, dtype=dtype('int32'))
# [     1]   ndarray(shape=50000, dtype=dtype('int32'))
# [     2]   ndarray(shape=200000, dtype=dtype('float64'))

i.layout
#  layout 
# [    ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [     0]   ndarray(shape=50000, dtype=dtype('int32'))
# [     1]   ndarray(shape=50000, dtype=dtype('int32'))
# [     2]   ndarray(shape=182108, dtype=dtype('int8'))

but I can't be sure I have the same structure as you because I had to do some trickery to work around the bit masks that Parquet introduced.

Another possible Parquet artifact: your assoc probably doesn't have type int8 (1-byte), does it? Parquet probably did that for compression.

I think I've reproduced the bug, and also narrowed it down to the issue of the int8 (1-byte) indexes, if that's what you have in your dataset.

x[i]                            # no error, as you reported for non-Tables
awkward.JaggedArray.zip(x)[i]   # raises an error because it's a Table

This is the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-array/awkward/array/jagged.py", line 548, in __getitem__
    return self.copy(starts=head._starts, stops=head._stops, content=self._content[indexes])
  File "/home/jpivarski/irishep/awkward-array/awkward/array/table.py", line 612, in __getitem__
    newslice = self._newslice(head)
  File "/home/jpivarski/irishep/awkward-array/awkward/array/table.py", line 525, in _newslice
    raise IndexError("some indexes out of bounds for length {0}".format(length))
IndexError: some indexes out of bounds for length 200000

And in particular, this error does not happen if I prepare i like this:

i = awkward.JaggedArray(i.content.starts, i.content.stops, i.content.content.content.astype("i8"))

and I see a point in JaggedArray.__getitem__ that doesn't handle small integer types well.

In your dataset, are the assoc indexes 1-byte? It wouldn't be too surprising if they are, as local indexes are likely to be below 255, but when I convert them to global indexes, that's not a big enough type.

Does the pull request (#174, above) fix your issue? (Another way of asking what I asked in the previous comment.)

Thanks, that's fixed it.

Thanks for your help, too!