dask-contrib/dask-histogram

Unserializable histogram generated if computed without fill call

Closed this issue · 5 comments

Originally posted here: scikit-hep/hist#586
And I found that this issue can be replicated entirely upstream using just dask-histogram

import pickle

import boost_histogram
import dask
import dask_histogram.boost
import numpy

h = dask_histogram.boost.Histogram(boost_histogram.axis.Regular(10, 0, 1))
# h.fill( # Toggle to include a fill call
#     dask.array.from_array(numpy.zeros(shape=(10))),
#     weight=dask.array.from_array(numpy.zeros(shape=(10))),
# )

o = dask.compute(h)
pickle.dump(o, open("hist_dask_test.pkl", "wb"))

Without a fill call, the resulting histogram will have dangling a self._dask field that breaks serialization.

Hm, the computed object shouldn't have any ties to dask at all.

Note that cloudpickle (and dask's serialize) is able to serialise this object.

The following fixes it, but I would like to figure out why the attribute is there in the first place

--- a/src/dask_histogram/boost.py
+++ b/src/dask_histogram/boost.py
@@ -137,7 +137,12 @@ class Histogram(bh.Histogram, DaskMethodsMixin, family=dask_histogram):
         return self.dask_name

     def __dask_postcompute__(self) -> Any:
-        return lambda x: self._in_memory_type(first(x)), ()
+        def f(x):
+            out = self._in_memory_type(first(x))
+            if hasattr(out, "_dask"):
+                del out._dask
+            return out
+        return f, ()

     def __dask_postpersist__(self) -> Any:

I have not yet come up with any better solution, so if you would check what I put above, I would appreciate it.

I have not yet come up with any better solution, so if you would check what I put above, I would appreciate it.

Just confirmed that this indeed allows empty histograms to be pickled/unpickled as usual. Thanks!

OK, let's make that PR for now, as certainly it doesn't seem to have a downside.