scikit-hep/awkward-0.x

virtual/base/chunked array crash going through uproot.daskframe

douglasdavis opened this issue · 8 comments

I'm seeing a crash via uproot.daskframe that appears to be originating from awkward but I'm not entirely sure so please correct me if I'm wrong there.

This is coming from uproot.daskframe for some very large "flat" (scalar branches only) TTrees. The crash copied below is from just trying to grab 2 branches from a list of 3 files (each file is a few GB total with about 300 branches total; I'm happy to put one in a public place if necessary). I've experimented to 1, 2, 5, 10, and all branches -- same crash every time.

>>> uproot.daskframe(files, tree_name, branches=["pT_lep1", "pT_lep2"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1985, in daskframe
    array = dask.array.from_array(x, x.shape, fancy=True)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py", line 2645, in from_array
    asarray = not hasattr(x, "__array_function__")
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 241, in __getattr__
    if where in self.columns:
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 648, in columns
    return self._util_columns(set())
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/chunked.py", line 727, in _util_columns
    return self._util_columns_descend(self._chunks[chunkid], seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
    return array._util_columns(seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 476, in _util_columns
    return self._util_columns_descend(self.array, seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
    return array._util_columns(seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/chunked.py", line 727, in _util_columns
    return self._util_columns_descend(self._chunks[chunkid], seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py", line 644, in _util_columns_descend
    return array._util_columns(seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 476, in _util_columns
    return self._util_columns_descend(self.array, seen)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 295, in array
    return self.materialize()
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/virtual.py", line 326, in materialize
    array = self._util_toarray(self._generator(*self._args, **self._kwargs), self.DEFAULTTYPE)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1912, in __call__
    return self.branch.array(interpretation=self.interpretation, entrystart=entrystart, entrystop=entrystop, flatten=self.flatten, awkwardlib=self.awkwardlib, cache=None, basketcache=self.basketcache, keycache=self.keycache, executor=self.executor, blocking=True)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1395, in array
    _delayedraise(fill(j))
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 58, in _delayedraise
    raise err.with_traceback(trc)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1363, in fill
    source = self._basket(i, interpretation, local_entrystart, local_entrystop, awkward, basketcache, keycache)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py", line 1149, in _basket
    basketcache[basketcachekey] = basketdata
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/cache.py", line 67, in __setitem__
    self._cache[where] = what
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/cachetools/lru.py", line 21, in __setitem__
    cache_setitem(self, key, value)
  File "/home/ddavis/.pyenv/versions/3.7.4/lib/python3.7/site-packages/cachetools/cache.py", line 49, in __setitem__
    raise ValueError('value too large')
ValueError: value too large

Lazy arrays in uproot (which is what uproot.daskframe is) have a default basketcache of 1 MB in case you reread the same arrays. I should rethink that policy because the default behavior of the VirtualArrays that lazily load data is to hold onto it indefinitely—unless you specify a more limited cache for the finalized arrays, you'll never be going back to the baskets from which they are constructed.

The "value too large" error comes from cachetools when putting a single item into a cache would exceed the cache's limit. In principle, this is telling us that your ROOT file has at least one basket that is larger than 1 MB, which is unusual in my experience, but not impossible.

Here's something to try: set basketcache=uproot.ArrayCache("1 GB") to give the basketcache a 1 GB limit, or basketcache={} to let it hold everything indefinitely, or maybe better basketcache=cachetoolsLRUCache(1) to just hold one item, one basket, independent of its size.

I'm having trouble thinking of a good reason why there should be any default basketcache at all, considering the default behavior of VirtualArray to hold what it has read indefinitely. The only times you'd ever want a basketcache is if you've explicitly limited VirtualArray's cache through the cache parameter, which would be a non-default case.

If your file does not have > 1 MB baskets, then this could be related to scikit-hep/uproot3#317.

Thanks for the pointers. Looks like we lose the "value too large" crash with either, because I tried both uproot.ArrayCache("1 GB") and cachetools.LRUCache(1) as my basketcache argument but Dask's meta_from_array appears to break from a missing ndim on the input argument.

In [16]: cache = cachetools.LRUCache(1)

In [17]: uproot.daskframe(files, tree_name, basketcache=cache, branches=["pT_lep1", "pT_lep2"])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-f2190a4d84a2> in <module>
----> 1 uproot.daskframe(files, tree_name, basketcache=cache, branches=["pT_lep1", "pT_lep2"])

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py in daskframe(path, treepath, branches, namedecode, entrysteps, flatten, awkwardlib, cache, basketcache, keycache, executor, localsource, xrootdsource, httpsource, **options)
   1983         x = out[n]
   1984         if len(x.shape) == 1:
-> 1985             array = dask.array.from_array(x, x.shape, fancy=True)
   1986             series.append(dask.dataframe.from_dask_array(array, columns=n))
   1987         else:

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
   2698         meta = x
   2699
-> 2700     return Array(dsk, name, chunks, meta=meta, dtype=getattr(x, "dtype", None))
   2701
   2702

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in __new__(cls, dask, name, chunks, dtype, meta, shape)
   1007         self.dask = dask
   1008         self.name = name
-> 1009         meta = meta_from_array(meta, dtype=dtype)
   1010
   1011         if (

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/utils.py in meta_from_array(x, ndim, dtype)
     84
     85     if ndim is None:
---> 86         ndim = x.ndim
     87
     88     try:

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
    245                     raise AttributeError("while trying to get column {0}, an exception occurred:\n{1}: {2}".format(repr(where), type(err), str(err)))
    246             else:
--> 247                 raise AttributeError("no column named {0}".format(repr(where)))
    248
    249     def __dir__(self):

AttributeError: no column named 'ndim'

I guess that means that you do, indeed, have baskets larger than 1 MB.

My first action item is to remove the default basketcache from all lazy arrays because I don't see how it does any good, and it can cause this hard-to-understand issue for large baskets.

The second action item is to add an ndim property to all awkward-arrays. It should be len(self.shape).

I went ahead and added a patch to awkward/array/base.py introducing an ndim property. Looks like self.columns is None (crash below). I couldn't figure out why that might be the case from searching around the code for columns uses.

In [11]: uproot.daskframe(files, tree_name, basketcache=cache)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-3beb08962cc8> in <module>
----> 1 uproot.daskframe(files, tree_name, basketcache=cache)

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/uproot/tree.py in daskframe(path, treepath, branches, namedecode, entrysteps, flatten, awkwardlib, cache, basketcache, keycache, executor, localsource, xrootdsource, httpsource, **options)
   1983         x = out[n]
   1984         if len(x.shape) == 1:
-> 1985             array = dask.array.from_array(x, x.shape, fancy=True)
   1986             series.append(dask.dataframe.from_dask_array(array, columns=n))
   1987         else:

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
   2698         meta = x
   2699
-> 2700     return Array(dsk, name, chunks, meta=meta, dtype=getattr(x, "dtype", None))
   2701
   2702

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/core.py in __new__(cls, dask, name, chunks, dtype, meta, shape)
   1022             raise ValueError(CHUNKS_NONE_ERROR_MESSAGE)
   1023
-> 1024         self._meta = meta_from_array(meta, ndim=self.ndim, dtype=dtype)
   1025
   1026         for plugin in config.get("array_plugins", ()):

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/dask/array/utils.py in meta_from_array(x, ndim, dtype)
     49     # If using x._meta, x must be a Dask Array, some libraries (e.g. zarr)
     50     # implement a _meta attribute that are incompatible with Dask Array._meta
---> 51     if hasattr(x, "_meta") and isinstance(x, Array):
     52         x = x._meta
     53

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
    243             return super(AwkwardArray, self).__getattribute__(where)
    244         else:
--> 245             if where in self.columns:
    246                 try:
    247                     return self[where]

TypeError: argument of type 'NoneType' is not iterable

I did a quick search of all the places where it might be.

  • if ChunkedArray._util_columns encounters a case with zero self._chunks (weird; I don't remember if that's allowed), then it would implicitly return None (which then goes into defining ChunkedArray.chunks.

That's the only case I found. A DaskFrame is made of lazy arrays, and a lazy array is a ChunkedArray of VirtualArrays, so this apparent mistake is probably it.

Some print debugging in ChunkedArray._util_columns is showing that a chunked array showing empty is registering a len equal to 1.

diff --git a/awkward/array/chunked.py b/awkward/array/chunked.py
index 95b811b..1b33ff8 100644
--- a/awkward/array/chunked.py
+++ b/awkward/array/chunked.py
@@ -722,6 +722,10 @@ class ChunkedArray(awkward.array.base.AwkwardArray):
         if id(self) in seen:
             return []
         seen.add(id(self))
+        print(self._chunks)
+        print(len(self._chunks))
+        if len(self._chunks) == 0:
+            return []
         for chunkid in range(len(self._chunks)):
             self.knowchunksizes(chunkid + 1)
             if self._chunksizes[chunkid] > 0:

Giving me this output:

[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f79f6ba09d0>, <VirtualArray [310000 310000 310000 ... 310000 310000 310000] at 0x7f79f6905d50>, <VirtualArray [300000 300000 300000 ... 300000 300000 300000] at 0x7f79f6905dd0>]
3
[<VirtualArray [284500 284500 284500 ... 284500 284500 284500] at 0x7f788136b5d0>]
1
[<ChunkedArray [] at 0x7f7b6e3abb10>]
1

before crashing again. Looks like ChunkedArray.__len__ will always be at least one from offsets.

    def __len__(self):
        self.knowchunksizes()
        return self.offsets[-1]

So then I went ahead with this patch:

diff --git a/awkward/array/chunked.py b/awkward/array/chunked.py
index 95b811b..f195092 100644
--- a/awkward/array/chunked.py
+++ b/awkward/array/chunked.py
@@ -726,6 +726,8 @@ class ChunkedArray(awkward.array.base.AwkwardArray):
             self.knowchunksizes(chunkid + 1)
             if self._chunksizes[chunkid] > 0:
                 return self._util_columns_descend(self._chunks[chunkid], seen)
+            else:
+                return []

     def _util_rowname(self, seen):
         if id(self) in seen:

and the crash is gone. If you think this makes sense I'll open a PR from my branch

I found some tests to confirm that empty chunks are legal. I think you're getting them because some files or some clusters are empty.

The offsets of a completely empty ChunkedArray (i.e. no chunks at all) is numpy.array([0]) because the length of the offsets array is always one more than the length of what it's counting and it always starts with zero. So the __len__ of a ChunkedArray can be zero; that's what you'd get from offsets[-1] of a completely empty ChunkedArray. (Incidentally, if the ChunkedArray were full of empty chunks, then offsets would be numpy.array([0, 0, ..., 0]), so the length of such a thing would also be zero. There are multiple ways to get a ChunkedArray of length zero.

I like your fix except that I think it should be indented one less than it is. The for loop over chunkid in line 725 is to look for the first non-empty chunk so that it can return the columns of that chunk. You want it to continue looking for a non-empty chunk until it finds one. So do this:

    def _util_columns(self, seen):
        if id(self) in seen:
            return []
        seen.add(id(self))
        for chunkid in range(len(self._chunks)):
            self.knowchunksizes(chunkid + 1)
            if self._chunksizes[chunkid] > 0:
                return self._util_columns_descend(self._chunks[chunkid], seen)
        return []

(or equivalently, give the for loop an else statement, but I think this is clearer).

With that one change and the ndim property on awkward.array.base.AwkwardArray, yes, please do open a pull request, and I'll merge it. It would be even better if you increase the version number to 0.12.7 so I can deploy it with the new version.

Thank you!

Great! PR made with an adjusted empty list return