virtual/base/chunked array crash going through uproot.daskframe
douglasdavis opened this issue · 8 comments
I'm seeing a crash via uproot.daskframe
that appears to be originating from awkward but I'm not entirely sure so please correct me if I'm wrong there.
This is coming from uproot.daskframe
for some very large "flat" (scalar branches only) TTrees. The crash copied below is from just trying to grab 2 branches from a list of 3 files (each file is a few GB total with about 300 branches total; I'm happy to put one in a public place if necessary). I've experimented to 1, 2, 5, 10, and all branches -- same crash every time.
>>> uproot.daskframe(files, tree_name, branches=["pT_lep1", "pT_lep2"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1985, in daskframe
array = dask.array.from_array(x, x.shape, fancy=True)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py", line 2645, in from_array
asarray = not hasattr(x, "__array_function__")
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 241, in __getattr__
if where in self.columns:
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 648, in columns
return self._util_columns(set())
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/chunked.py", line 727, in _util_columns
return self._util_columns_descend(self._chunks[chunkid], seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
return array._util_columns(seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 476, in _util_columns
return self._util_columns_descend(self.array, seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
return array._util_columns(seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/chunked.py", line 727, in _util_columns
return self._util_columns_descend(self._chunks[chunkid], seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
return array._util_columns(seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 476, in _util_columns
return self._util_columns_descend(self.array, seen)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 295, in array
return self.materialize()
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 326, in materialize
array = self._util_toarray(self._generator(*self._args, **self._kwargs), self.DEFAULTTYPE)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1912, in __call__
return self.branch.array(interpretation=self.interpretation, entrystart=entrystart, entrystop=entrystop, flatten=self.flatten, awkwardlib=self.awkwardlib, cache=None, basketcache=self.basketcache, keycache=self.keycache, executor=self.executor, blocking=True)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1395, in array
_delayedraise(fill(j))
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 58, in _delayedraise
raise err.with_traceback(trc)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1363, in fill
source = self._basket(i, interpretation, local_entrystart, local_entrystop, awkward, basketcache, keycache)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1149, in _basket
basketcache[basketcachekey] = basketdata
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/cache.py", line 67, in __setitem__
self._cache[where] = what
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/cachetools/lru.py", line 21, in __setitem__
cache_setitem(self, key, value)
File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/cachetools/cache.py", line 49, in __setitem__
raise ValueError('value too large')
ValueError: value too large
Lazy arrays in uproot (which is what uproot.daskframe
is) have a default basketcache
of 1 MB in case you reread the same arrays. I should rethink that policy because the default behavior of the VirtualArrays
that lazily load data is to hold onto it indefinitely—unless you specify a more limited cache
for the finalized arrays, you'll never be going back to the baskets from which they are constructed.
The "value too large" error comes from cachetools when putting a single item into a cache would exceed the cache's limit. In principle, this is telling us that your ROOT file has at least one basket that is larger than 1 MB, which is unusual in my experience, but not impossible.
Here's something to try: set basketcache=uproot.ArrayCache("1 GB")
to give the basketcache
a 1 GB limit, or basketcache={}
to let it hold everything indefinitely, or maybe better basketcache=cachetoolsLRUCache(1)
to just hold one item, one basket, independent of its size.
I'm having trouble thinking of a good reason why there should be any default basketcache
at all, considering the default behavior of VirtualArray
to hold what it has read indefinitely. The only times you'd ever want a basketcache
is if you've explicitly limited VirtualArray
's cache through the cache
parameter, which would be a non-default case.
If your file does not have > 1 MB baskets, then this could be related to scikit-hep/uproot3#317.
Thanks for the pointers. Looks like we lose the "value too large" crash with either, because I tried both uproot.ArrayCache("1 GB")
and cachetools.LRUCache(1)
as my basketcache
argument but Dask's meta_from_array
appears to break from a missing ndim
on the input argument.
In [16]: cache = cachetools.LRUCache(1)
In [17]: uproot.daskframe(files, tree_name, basketcache=cache, branches=["pT_lep1", "pT_lep2"])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-f2190a4d84a2> in <module>
----> 1 uproot.daskframe(files, tree_name, basketcache=cache, branches=["pT_lep1", "pT_lep2"])
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py in daskframe(path, treepath, branches, namedecode, entrysteps, flatten, awkwardlib, cache, basketcache, keycache, executor, localsource, xrootdsource, httpsource, **options)
1983 x = out[n]
1984 if len(x.shape) == 1:
-> 1985 array = dask.array.from_array(x, x.shape, fancy=True)
1986 series.append(dask.dataframe.from_dask_array(array, columns=n))
1987 else:
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
2698 meta = x
2699
-> 2700 return Array(dsk, name, chunks, meta=meta, dtype=getattr(x, "dtype", None))
2701
2702
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in __new__(cls, dask, name, chunks, dtype, meta, shape)
1007 self.dask = dask
1008 self.name = name
-> 1009 meta = meta_from_array(meta, dtype=dtype)
1010
1011 if (
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/utils.py in meta_from_array(x, ndim, dtype)
84
85 if ndim is None:
---> 86 ndim = x.ndim
87
88 try:
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
245 raise AttributeError("while trying to get column {0}, an exception occurred:\n{1}: {2}".format(repr(where), type(err), str(err)))
246 else:
--> 247 raise AttributeError("no column named {0}".format(repr(where)))
248
249 def __dir__(self):
AttributeError: no column named 'ndim'
I guess that means that you do, indeed, have baskets larger than 1 MB.
My first action item is to remove the default basketcache
from all lazy arrays because I don't see how it does any good, and it can cause this hard-to-understand issue for large baskets.
The second action item is to add an ndim
property to all awkward-arrays. It should be len(self.shape)
.
I went ahead and added a patch to awkward/array/base.py
introducing an ndim
property. Looks like self.columns
is None
(crash below). I couldn't figure out why that might be the case from searching around the code for columns
uses.
In [11]: uproot.daskframe(files, tree_name, basketcache=cache)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-3beb08962cc8> in <module>
----> 1 uproot.daskframe(files, tree_name, basketcache=cache)
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py in daskframe(path, treepath, branches, namedecode, entrysteps, flatten, awkwardlib, cache, basketcache, keycache, executor, localsource, xrootdsource, httpsource, **options)
1983 x = out[n]
1984 if len(x.shape) == 1:
-> 1985 array = dask.array.from_array(x, x.shape, fancy=True)
1986 series.append(dask.dataframe.from_dask_array(array, columns=n))
1987 else:
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
2698 meta = x
2699
-> 2700 return Array(dsk, name, chunks, meta=meta, dtype=getattr(x, "dtype", None))
2701
2702
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in __new__(cls, dask, name, chunks, dtype, meta, shape)
1022 raise ValueError(CHUNKS_NONE_ERROR_MESSAGE)
1023
-> 1024 self._meta = meta_from_array(meta, ndim=self.ndim, dtype=dtype)
1025
1026 for plugin in config.get("array_plugins", ()):
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/utils.py in meta_from_array(x, ndim, dtype)
49 # If using x._meta, x must be a Dask Array, some libraries (e.g. zarr)
50 # implement a _meta attribute that are incompatible with Dask Array._meta
---> 51 if hasattr(x, "_meta") and isinstance(x, Array):
52 x = x._meta
53
~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
243 return super(AwkwardArray, self).__getattribute__(where)
244 else:
--> 245 if where in self.columns:
246 try:
247 return self[where]
TypeError: argument of type 'NoneType' is not iterable
I did a quick search of all the places where it might be.
- if ChunkedArray._util_columns encounters a case with zero
self._chunks
(weird; I don't remember if that's allowed), then it would implicitly returnNone
(which then goes into definingChunkedArray.chunks
.
That's the only case I found. A DaskFrame
is made of lazy arrays, and a lazy array is a ChunkedArray
of VirtualArrays
, so this apparent mistake is probably it.
Some print debugging in ChunkedArray._util_columns
is showing that a chunked array showing empty is registering a len
equal to 1.
diff --git a/awkward/array/chunked.py b/awkward/array/chunked.py
index 95b811b..1b33ff8 100644
--- a/awkward/array/chunked.py
+++ b/awkward/array/chunked.py
@@ -722,6 +722,10 @@ class ChunkedArray(awkward.array.base.AwkwardArray):
if id(self) in seen:
return []
seen.add(id(self))
+ print(self._chunks)
+ print(len(self._chunks))
+ if len(self._chunks) == 0:
+ return []
for chunkid in range(len(self._chunks)):
self.knowchunksizes(chunkid + 1)
if self._chunksizes[chunkid] > 0:
Giving me this output:
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<ChunkedArray [] at 0x7f7b6e3abb10>]
1
before crashing again. Looks like ChunkedArray.__len__
will always be at least one from offsets.
def __len__(self):
self.knowchunksizes()
return self.offsets[-1]
So then I went ahead with this patch:
diff --git a/awkward/array/chunked.py b/awkward/array/chunked.py
index 95b811b..f195092 100644
--- a/awkward/array/chunked.py
+++ b/awkward/array/chunked.py
@@ -726,6 +726,8 @@ class ChunkedArray(awkward.array.base.AwkwardArray):
self.knowchunksizes(chunkid + 1)
if self._chunksizes[chunkid] > 0:
return self._util_columns_descend(self._chunks[chunkid], seen)
+ else:
+ return []
def _util_rowname(self, seen):
if id(self) in seen:
and the crash is gone. If you think this makes sense I'll open a PR from my branch
I found some tests to confirm that empty chunks are legal. I think you're getting them because some files or some clusters are empty.
The offsets
of a completely empty ChunkedArray
(i.e. no chunks at all) is numpy.array([0])
because the length of the offsets array is always one more than the length of what it's counting and it always starts with zero. So the __len__
of a ChunkedArray
can be zero; that's what you'd get from offsets[-1]
of a completely empty ChunkedArray
. (Incidentally, if the ChunkedArray
were full of empty chunks, then offsets
would be numpy.array([0, 0, ..., 0])
, so the length of such a thing would also be zero. There are multiple ways to get a ChunkedArray
of length zero.
I like your fix except that I think it should be indented one less than it is. The for
loop over chunkid
in line 725 is to look for the first non-empty chunk so that it can return the columns of that chunk. You want it to continue looking for a non-empty chunk until it finds one. So do this:
def _util_columns(self, seen):
if id(self) in seen:
return []
seen.add(id(self))
for chunkid in range(len(self._chunks)):
self.knowchunksizes(chunkid + 1)
if self._chunksizes[chunkid] > 0:
return self._util_columns_descend(self._chunks[chunkid], seen)
return []
(or equivalently, give the for
loop an else
statement, but I think this is clearer).
With that one change and the ndim
property on awkward.array.base.AwkwardArray
, yes, please do open a pull request, and I'll merge it. It would be even better if you increase the version number to 0.12.7 so I can deploy it with the new version.
Thank you!
Great! PR made with an adjusted empty list return