Can't use gather operation on a filtered array
beojan opened this issue · 28 comments
I have code that looks like the following:
jets = tree['jets'][have_four_jets]
quarks = tree['quarks'][have_four_jets]
assoc = tree['jet_quark_assoc'][have_four_jets]
assoc = assoc[assoc != 99]
assoc_quarks = quarks[assoc]
However, np.size(quarks.content) != np.size(assoc.content)
since quarks.content
still points to the original unfiltered array, so this gives the error
IndexError: some indexes out of bounds for length 200000
where 200000 is the length of quarks.content
.
I don't know enough about quarks
and assoc
to know if there is a problem. Are they both JaggedArrays? (I would assume so because you're talking about their .content
).
If the filter have_four_jets
is not jagged, then it should not change the .content
of the jagged array to which it is applied. Flat filters should replace only the starts
and stops
, leaving content
alone.
Can you get an error for a small slice of 10 or so events (first 10, last 10, somewhere in between) so that you can print them out? I always print out slices of my arrays to solve indexing problems. If you can't quarks[assoc]
, then you can't (quarks[s])[assoc[s]]
for some slice s
, and finding an example slice will make it very clear what's wrong.
Oh! Hey! It just took a bit longer for me to think about it—there could be a bad assumption (i.e. assuming contiguous starts
and stops
) in the implementation of __getitem__
by jagged integer indexes that makes it fail (as above) for arrays that should logically work.
Try using .compact()
on all of your jagged arrays, both the quarks
and the assoc
. If that makes it work, then that's the bad assumption and we need to do .compact()
on either the quarks
or the assoc
(or both) in __getitem__
.
What .compact()
does is to simplify the starts
and stops
, making them contiguous, and filters the content
to give you a JaggedArray with the same meaning. That's exactly what you were asking for. (It should not happen automatically when you [have_four_muons]
, but it should be done internally by code that wants to assume contiguous starts
and stops
, and we may just be missing test cases for that.)
That gives the error:
some indexes out of bounds for length 156264
156264 is the length of both .content
arrays.
EDIT: In fact, using the slice 0:33
gives this error, while 0:32
works.
So it's not just a matter of the algorithm assuming contiguous starts
and stops
, because you provided that with .content()
and it still fails.
My original thought was that it's an error in the values of the assoc
index—look at the values in element 33
. You might see why it shouldn't work. At least print them out—we can't tell what's going on without knowing the values and the structure of these arrays.
I did print both element 32 and 33 of both arrays, and they both look fine:
np.size(q.content): 156264 , np.size(assoc.content): 156264
q[32]: [TLorentzVector(-5.0028, 30.303, -46.777, 56.156) TLorentzVector(-84.361, -100.2, -64.974, 146.29) TLorentzVector(145.15, -84.904, 78.628, 185.69) TLorentzVector(35.454, -13.55, -43.099, 57.621)]
q[33]: [TLorentzVector(75.276, 81.816, 6.6091, 111.47) TLorentzVector(-12.861, 152.63, -71.253, 169) TLorentzVector(-11.143, -124.26, 170.76, 211.53) TLorentzVector(2.1352, -33.484, -20.679, 39.692)]
assoc[32]: [2 1 3 0] , assoc[33]: [1 0 3 2]
Are there two implementations somewhere, one for arrays with 32 or fewer elements and one for arrays with more?
No, there aren't. It's one implementation for all sizes. 32 isn't at a conspicuous integer size boundary, like 256, either. (5-bit integers aren't an option.)
The reason for printing out an example of what fails is for me to try to reproduce it—I can't do much without that kind of information. This is the first that I learned that they're TLorentzVectors, for instance.
I'm trying to reproduce the effect by simulating each of your steps, but I'm not finding any errors:
import awkward
# "a" is like "quarks"
a = awkward.fromiter([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8], [9.9]])
"i" is like "assoc"
i = awkward.fromiter([[1, 0, 0], [], [1], [3, 3, 0, 0], [0, 0, 0]])
# "m1" is like "have_four_jets"
m1 = awkward.fromiter([True, True, False, False, True])
# "m2" is like "assoc != 99" (after filtering for "have_four_jets"
m2 = awkward.fromiter([[True, False, True], [], [False, False, True]])
# filter like "have_four_jets"
am1 = a[m1]
im1 = i[m1]
# am1 is <JaggedArray [[0.0 1.1 2.2] [] [9.9]] at 0x7d961115ed30>
# im1 is <JaggedArray [[1 0 0] [] [0 0 0]] at 0x7d961115ee80>
# filter like "assoc != 99"
im2 = im1[m2]
# im2 is <JaggedArray [[1 0] [] [0]] at 0x7d961116cd30>
# in the end,
# a[i] is <JaggedArray [[1.1 0.0 0.0] [] [4.4] [8.8 8.8 5.5 5.5] [9.9 9.9 9.9]] at 0x7d961116c908>
# am1[im1] is <JaggedArray [[1.1 0.0 0.0] [] [9.9 9.9 9.9]] at 0x7d961116c2e8>
# am1[im2] is <JaggedArray [[1.1 0.0] [] [9.9]] at 0x7d961116ca58>
All of this is as expected—not just the lack of errors, but the specific values are what we expect them to be. (You don't deal in negative indexes, do you? That's why I didn't test them in this example.) I was thinking that if there was an error in indexing, even though I'm working with small array sizes, I would see some off-by-one error here that accumulates to an out-of-bounds in your example. No such luck.
I think I'm going to need to see the original file, and that might delay by another day because I'm travelling.
I found the same thing with plain numbers. I'm trying to reproduce with TLorentzVector
s but I'm having trouble making a JaggedArray
out of a TLorentzVectorArray
.
I didn't mean that I think it's relevant that they're TLorentzVectors, just illustrating how little I know about your example. Sorry for the confusion.
(Anyway, the way to make a jagged array of TLorentzVectors is to pass JaggedArrays as the components in TLorentzVectorArray.from_ptetaphim
or one of the other constructors...)
Also, I noticed that the implementation,
https://github.com/scikit-hep/awkward-array/blob/master/awkward/array/jagged.py#L529-L548
is doing .tocompact()
as a first step (through offsets derived from counts, passed to _tojagged
with an underscore, which was how I did it before .tocompact()
was introduced).
You didn't tell me on which line number you get the error. I might be able to figure it out from that without downloading the whole file. Which line is it?
IndexError Traceback (most recent call last)
<ipython-input-41-091abd66002d> in <module>
----> 1 quarks[truth_assoc]
/usr/lib/python3.7/site-packages/awkward/array/jagged.py in __getitem__(self, where)
546 indexes += head.tojagged(self._starts)._content
547
--> 548 return self.copy(starts=head._starts, stops=head._stops, content=self._content[indexes])
549
550 elif len(head.shape) == 1 and issubclass(head._content.dtype.type, (self.numpy.bool, self.numpy.bool_)):
/usr/lib/python3.7/site-packages/awkward/array/objects.py in __getitem__(self, where)
188 head, tail = where[0], where[1:]
189
--> 190 content = self._content[head]
191 if self._util_isinteger(head):
192 if isinstance(tail, tuple) and tail == ():
/usr/lib/python3.7/site-packages/awkward/array/table.py in __getitem__(self, where)
610 raise NotImplementedError("multidimensional index through a Table (TODO: needed for [0, n) -> [0, m) -> \"table\" -> ...)")
611
--> 612 newslice = self._newslice(head)
613
614 if self._util_isinteger(newslice):
/usr/lib/python3.7/site-packages/awkward/array/table.py in _newslice(self, head)
523 head[negative] += length
524 if not self.numpy.bitwise_and(0 <= head, head < length).all():
--> 525 raise IndexError("some indexes out of bounds for length {0}".format(length))
526
527 if self._view is None:
IndexError: some indexes out of bounds for length 200000
What's the stack trace when you do it with plain numbers?
None. It works with plain numbers.
That is, when I said "same thing" I meant same thing as you.
Oh, I see. Then JaggedArray.__getitem__
is probably the wrong place to look... (That's where I had been looking.)
It's probably in Table._newslice(head)
.
So it's probably true that quarks.mass[assoc]
succeeds while quarks[assoc]
fails, right? (So it is relevant that these are Tables
, though not necessarily TLorentzVector
.)
Yes. quarks.mass[assoc]
works.
I've thought about a few other questions I can ask to try to remotely solve it ("debugging via telegraph"), but it would be easier if you could just send me the file. I assume that your original example
quarks = tree['quarks'][have_four_jets]
assoc = tree['jet_quark_assoc'][have_four_jets]
assoc = assoc[assoc != 99]
quarks[assoc]
is sufficient to reproduce the error if tree
is the "Events"
tree in this file? How is have_four_jets
defined? Like this?
have_four_jets = (tree['jets'].counts >= 4)
Actually it's (assoc !=99).count_nonzero() != 4
. (I.e. do we have at least four associated jets).
What I'm trying to do is get the quarks in the same order (in each event) as the jets, while removing events where I don't have four associated jets. quarks
always has four TLorentzVector
s per event.
Anyway, I get the same error even if I remove the have_four_jets
complication.
That sounds good, I'll use that to reproduce your error. But I still need the file.
I'm in a session of talks for the next 2 hours. The wifi here seems okay, so as long as the file is 10s or 100s of MB, it should be okay (I only need three branches). After this session, I'll be flying and out of contact, so I'll need the file soon or I'll have to leave this for tomorrow.
You know, if the file is too large, you can pull out the relevant data and save them in awkward-array's native file format:
import uproot, awkward
tree = uproot.open("~/irishep/uproot/tests/samples/HZZ-objects.root")["events"]
# turn the TLorentzVectorArray into a generic Table
muonp4 = tree.array("muonp4")
muonp4 = awkward.JaggedArray.zip(pt=muonp4.pt, eta=muonp4.eta, phi=muonp4.phi, mass=muonp4.mass)
# or if the cartesian coordinates are more numerically stable
muonp4 = tree.array("muonp4")
muonp4 = awkward.JaggedArray.zip(px=muonp4.x, py=muonp4.y, pz=muonp4.z, E=muonp4.t)
jetp4 = tree.array("jetp4")
jetp4 = awkward.JaggedArray.zip(px=jetp4.x, py=jetp4.y, pz=jetp4.z, E=jetp4.t)
# collect them into an event Table
events = awkward.Table(muons=muonp4, jets=jetp4)
# and save it
awkward.save("smaller.awkd", events)
# this is how you get it back
events2 = awkward.load("smaller.awkd")
# see?
events[0].tolist(), events2[0].tolist()
# {'muons': [{'px': -52.89945602416992, 'py': -11.654671669006348, 'pz': -8.16079330444336, 'E': 54.77949905395508}, {'px': 37.7377815246582, 'py': 0.6934735774993896, 'pz': -11.307581901550293, 'E': 39.401695251464844}], 'jets': []}
# {'muons': [{'px': -52.89945602416992, 'py': -11.654671669006348, 'pz': -8.16079330444336, 'E': 54.77949905395508}, {'px': 37.7377815246582, 'py': 0.6934735774993896, 'pz': -11.307581901550293, 'E': 39.401695251464844}], 'jets': []}
The current theory is that it's the Table
that's causing the issue, not the TLorentzVector
(which is a Table
with extra stuff). If so, I should be able to reproduce the error from the "smaller.awkd"
file. If not, then I'll have to build the TLorentzVectorArray
, which is not a problem, either.
Can you do them as awkd files? (Just replace the awkward.toparquet
with awkward.save
. The latter is one-to-one with the original awkward arrays.) The Parquet file format introduces masking at every level that I have to remove to get back to your original example. (The masks are masking out nothing, but structure is important for this bug.)
I've attempted to reconstruct your data like this:
x = awkward.fromparquet("quark_x.parquet").chunks[0].array
i = awkward.fromparquet("assoc.parquet").chunks[0].array
x = awkward.JaggedArray(x.content.starts, x.content.stops, x.content.content.content)
i = awkward.JaggedArray(i.content.starts, i.content.stops, i.content.content.content)
x.layout
# layout
# [ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [ 0] ndarray(shape=50000, dtype=dtype('int32'))
# [ 1] ndarray(shape=50000, dtype=dtype('int32'))
# [ 2] ndarray(shape=200000, dtype=dtype('float64'))
i.layout
# layout
# [ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [ 0] ndarray(shape=50000, dtype=dtype('int32'))
# [ 1] ndarray(shape=50000, dtype=dtype('int32'))
# [ 2] ndarray(shape=182108, dtype=dtype('int8'))
but I can't be sure I have the same structure as you because I had to do some trickery to work around the bit masks that Parquet introduced.
Another possible Parquet artifact: your assoc
probably doesn't have type int8
(1-byte), does it? Parquet probably did that for compression.
I think I've reproduced the bug, and also narrowed it down to the issue of the int8
(1-byte) indexes, if that's what you have in your dataset.
x[i] # no error, as you reported for non-Tables
awkward.JaggedArray.zip(x)[i] # raises an error because it's a Table
This is the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/awkward-array/awkward/array/jagged.py", line 548, in __getitem__
return self.copy(starts=head._starts, stops=head._stops, content=self._content[indexes])
File "/home/jpivarski/irishep/awkward-array/awkward/array/table.py", line 612, in __getitem__
newslice = self._newslice(head)
File "/home/jpivarski/irishep/awkward-array/awkward/array/table.py", line 525, in _newslice
raise IndexError("some indexes out of bounds for length {0}".format(length))
IndexError: some indexes out of bounds for length 200000
And in particular, this error does not happen if I prepare i
like this:
i = awkward.JaggedArray(i.content.starts, i.content.stops, i.content.content.content.astype("i8"))
and I see a point in JaggedArray.__getitem__
that doesn't handle small integer types well.
In your dataset, are the assoc
indexes 1-byte? It wouldn't be too surprising if they are, as local indexes are likely to be below 255, but when I convert them to global indexes, that's not a big enough type.
Does the pull request (#174, above) fix your issue? (Another way of asking what I asked in the previous comment.)
Thanks, that's fixed it.
Thanks for your help, too!