Pickling class instance in dict fails unless you try pickling the object directly first
Closed this issue · 10 comments
I've found a very bizarre issue where, if I try to pickle an object (with various instance attributes) in a dictionary using dill, it fails on the first attempt. When un-pickling the object, it appears none of the instance attributes have been pickled along with the object. No matter how many times I re-run that cell (in a jupyter notebook), it fails. If I then try to pickle the same object directly, it works, and a subsequent attempt at pickling it in a dictionary now works too.
I've attached reproducible code below, and you can run it in this Colab notebook.
Note, I am the author of gumbi, so please feel free to let me know if this is some issue with gumbi itself!
Imports and setup
!pip install gumbi
import dill
import numpy as np
import gumbi as gmb
stdzr = gmb.Standardizer(d={'μ': -0.307, 'σ': 0.158}, log_vars=['d'])
pa = gmb.parray(d=np.arange(5,10)/10, stdzr=stdzr)
pa
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]
First attempt
Saving the object in a dictionary fails, the instance attributes seemingly don't get pickled:
with open('test.pkl', 'wb') as f:
dill.dump({'pa': pa}, f)
with open('test.pkl', 'rb') as f:
loaded = dill.load(f)
loaded['pa']
# AttributeError: 'ParameterArray' object has no attribute 'names'
A diversion
Saving the object on its own works:
with open('test.pkl', 'wb') as f:
dill.dump(pa, f)
with open('test.pkl', 'rb') as f:
loaded = dill.load(f)
loaded
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]
Second attempt
And now saving the object in a dictionary works:
with open('test.pkl', 'wb') as f:
dill.dump({'pa': pa}, f)
with open('test.pkl', 'rb') as f:
loaded = dill.load(f)
loaded['pa']
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]
Note that running the "diversion" before ever running the dictionary pickle works too!
Generally this shouldn't be the case. However, after a quick look at your code, I see that a ParameterArray
is (1) a subclass of a numpy.ndarray
, (2) uses __new__
, and (3) adds the names
attribute a bit more dynamically, with __array_finalize__
instead of during the usual __init__
call. All of these things can potentially cause issues for dill
during serialization.
Just as an order of business, can you tell me which version of dill
,python
, and numpy
you are seeing this with? Is it only seen when in a jupyter
notebook? Does it occur only with dump/load
, as opposed to dumps/loads
? Also, I'd like to see if what you are experiencing can be reproduced using a dummy child class of a numpy.ndarray
...
For example, how about the following? (this is also a slightly different case, as it's a class defined in __main__
as opposed to in a file).
Python 3.10.6 (main, Aug 3 2022, 04:36:46) [Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> import numpy as np
>>> dill.__version__
'0.3.6.dev0'
>>> np.__version__
'1.23.1'
>>> class Foo(np.ndarray):
... def __array_finalize__(self, *args, **kwds):
... self.names = 'foo!'
...
>>> f = Foo([1,2,3])
>>> d = dict(f=f)
>>> g = dill.copy(d)
>>> g['f'].names
'foo!'
>>> g = dill.copy(f)
>>> g.names
'foo!'
NOTE: the above also seems to work in a jupyter
notebook.
Also, can you show the traceback for your failure and success cases when you turn on pickle tracing (i.e. dill.detect.trace(True)
before calling dump
)?
Looks like I'm using slightly older dill and numpy versions than you, but your Foo
example works for me as well (see updated Colab), so it's not quite the same problem. Versions and tracebacks below:
Versions:
dill: 0.3.5.1
numpy: 1.21.6
gumbi: 0.1.11
Full traceback:
First attempt:
D2: <dict object at 0x7fee7f2c4690>
INFO:dill:D2: <dict object at 0x7fee7f2c4690>
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/IPython/core/formatters.py in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
3 frames
/usr/local/lib/python3.7/dist-packages/gumbi/arrays.py in __repr__(self)
245
246 def __repr__(self):
--> 247 return f'{tuple(self.names)}: {np.asarray(self)}'
248
249 def get(self, name, default=None):
AttributeError: 'ParameterArray' object has no attribute 'names'
A diversion
Nu: ((5,), [('d', '<f8')])
INFO:dill:Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
INFO:dill:F2: <function _create_array at 0x7fee9b9dc440>
# F2
INFO:dill:# F2
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3fa0>
INFO:dill:D2: <dict object at 0x7fee7f2c3fa0>
T4: <class 'gumbi.aggregation.Standardizer'>
INFO:dill:T4: <class 'gumbi.aggregation.Standardizer'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3c80>
INFO:dill:D2: <dict object at 0x7fee7f2c3c80>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3e10>
INFO:dill:D2: <dict object at 0x7fee7f2c3e10>
D2: <dict object at 0x7fee7f2c3f00>
INFO:dill:D2: <dict object at 0x7fee7f2c3f00>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# Nu
INFO:dill:# Nu
('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]
Second attempt:
D2: <dict object at 0x7fee7f2ac0f0>
INFO:dill:D2: <dict object at 0x7fee7f2ac0f0>
Nu: ((5,), [('d', '<f8')])
INFO:dill:Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
INFO:dill:F2: <function _create_array at 0x7fee9b9dc440>
# F2
INFO:dill:# F2
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3fa0>
INFO:dill:D2: <dict object at 0x7fee7f2c3fa0>
T4: <class 'gumbi.aggregation.Standardizer'>
INFO:dill:T4: <class 'gumbi.aggregation.Standardizer'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3c80>
INFO:dill:D2: <dict object at 0x7fee7f2c3c80>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3e10>
INFO:dill:D2: <dict object at 0x7fee7f2c3e10>
D2: <dict object at 0x7fee7f2c3f00>
INFO:dill:D2: <dict object at 0x7fee7f2c3f00>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# Nu
INFO:dill:# Nu
# D2
INFO:dill:# D2
('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]
@leogama: (unrelated to the rest of the issue) I noticed in the above traceback that there are duplicate entries within the pickle logging trace. Have you seen this behavior before? I haven't seen it previously.
@JohnGoertz: you can see from the traceback that there's a key difference in the pickling trace in the first attempt, and afterward.
The 'correct' trace identities a numpy
subclass instance ('Nu: (obj.shape, obj.dtype)'), and stores the _create_array
function:
D2: <dict object at 0x7fee7f2ac0f0>
Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
# F2
T4: <class 'gumbi.arrays.ParameterArray'>
# T4
T4: <class 'numpy.dtype'>
# T4
...
while the 'incorrect' trace misses the same identification, and just goes directly to the ParameterArray
:
D2: <dict object at 0x7fee7f2c4690>
T4: <class 'gumbi.arrays.ParameterArray'>
# T4
T4: <class 'numpy.dtype'>
# T4
...
So, I am going to assume this is a bug in identifying subclasses of numpy
arrays, as dill
has some "special handling" for that (i.e. it's slightly hackish compared to the handling on other python subclasses). [NOTE: I've deleted the 'INFO' lines from my snippets of your traces, for clarity] My first suspect is this function ndarraysubclassinstance(obj)
.
@leogama: (unrelated to the rest of the issue) I noticed in the above traceback that there are duplicate entries within the pickle logging trace. Have you seen this behavior before? I haven't seen it previously.
This happens when multiple logging.Handler
s are handling the logs to stdout/stderr. The OP likely has added a custom StreamHandler
to the root logger (logging.root
) and the tracing messages from dill are propagating ("bubbling") up to it. I've been thinking of using a separate logger, disconnected from the logging manager, for the pickling trace, and using the logging.getLogger('dill')
only for legitimate messages to the user, like reporting which objects were saved by reference because they failed pickling with the refonfail
option. But to fix this by now it's sufficient to set logger.propagate = False
.
I see, interesting. It makes sense that it could fail that way, I think the weirder thing is that the "diversion" fixes it downstream!
That's probably because, currently, the custom save functions for numpy
classes are only registered if the object is pickled directly with the Pickler.dump()
method, which is called by the dump()
, dumps()
and copy()
functions, but not if the object is part of another object. Also, the methods __reduce_ex_()
and __reduce__()
from numpy.ndarray
(and from gumbi.arrays.*Array
for that matter) don't preserve the array's __dict__
state.
This explains why, only after you call dill.dump(pa, f)
, the then registered dill's save_numpy_array()
starts to be used for serializing other ParameterArray
objects. The relevant code in Piclker.dump()
should probably be moved to an overriding Pickler.save()
method, which is called for every object pickled.
@JohnGoertz: Try your code after updating to master (with the latest commit: bf0f4fa), which should now trigger the registering of the numpy
array subclass inside the dict
.
>>> import numpy as np
>>> import dill
>>>
>>> class Foo(np.ndarray):
... def __array_finalize__(self, *args, **kwds):
... self.names = 'foo!'
...
>>> f = Foo([1,2,3])
>>> dill.detect.trace(True)
>>> d = dict(f=f)
>>> g = dill.copy(d)
┬ D2: <dict object at 0x0112e5af00>
├┬ Nu: ((1, 2, 3), float64)
│├┬ F2: <function _create_array at 0x112da63b0>
││└ # F2 [28 B]
│├┬ T2: <class '__main__.Foo'>
││├┬ F2: <function _create_type at 0x112d9f5f0>
│││└ # F2 [27 B]
││├┬ T1: <class 'type'>
│││├┬ F2: <function _load_type at 0x112d9f4d0>
││││└ # F2 [25 B]
│││└ # T1 [42 B]
││├┬ T4: <class 'numpy.ndarray'>
│││└ # T4 [17 B]
││├┬ D2: <dict object at 0x01089fc5f0>
│││├┬ F1: <function Foo.__array_finalize__ at 0x112e5cdd0>
││││├┬ F2: <function _create_function at 0x112d9f560>
│││││└ # F2 [31 B]
││││├┬ Co: <code object __array_finalize__ at 0x1087e94b0, file "<stdin>", line 2>
│││││├┬ F2: <function _create_code at 0x112d9f680>
││││││└ # F2 [27 B]
│││││└ # Co [150 B]
││││├┬ D1: <dict object at 0x0108763c80>
│││││└ # D1 [22 B]
││││├┬ D2: <dict object at 0x0112c3a2d0>
│││││└ # D2 [3 B]
││││├┬ D2: <dict object at 0x0112e5a960>
│││││├┬ D2: <dict object at 0x0112e5a9b0>
││││││└ # D2 [3 B]
│││││└ # D2 [78 B]
││││└ # F1 [299 B]
│││└ # D2 [376 B]
││└ # T2 [482 B]
│├┬ Dt: <class 'numpy.dtype'>
││├┬ F2: <function _create_dtypemeta at 0x112da6440>
│││└ # F2 [32 B]
││└ # Dt [39 B]
│├┬ D2: <dict object at 0x0112c33c30>
││└ # D2 [8 B]
│└ # Nu [730 B]
└ # D2 [742 B]
>>> g['f'].names
'foo!'
That seems to have done the trick! It now pickles the object correctly within the dictionary on the first try! I should also note that prior to this commit, the same issue arose when the object was contained in a list, but that is fixed now too.
!pip install git+https://github.com/uqfoundation/dill.git@bf0f4fa48358fac0a9246a3f12575767d1d14b6f
!pip install gumbi
import dill
import numpy as np
import gumbi as gmb
print('dill: ', dill.__version__)
print('numpy: ', np.__version__)
print('gumbi: ', gmb.__version__)
dill.detect.trace(True)
# dill: 0.3.6.dev0
# numpy: 1.21.6
# gumbi: 0.1.11
stdzr = gmb.Standardizer(d={'μ': -0.307, 'σ': 0.158}, log_vars=['d'])
pa = gmb.parray(d=np.arange(5,10)/10, stdzr=stdzr)
pa
with open('test.pkl', 'wb') as f:
dill.dump({'pa': pa}, f)
with open('test.pkl', 'rb') as f:
loaded = dill.load(f)
loaded['pa']
Output:
┬ D2: <dict object at 0x7fa92bce5050>
├┬ Nu: ((5,), [('d', '<f8')])
│├┬ F2: <function _create_array at 0x7fa9483a9ef0>
││└ # F2 [28 B]
│├┬ T4: <class 'gumbi.arrays.ParameterArray'>
││└ # T4 [31 B]
│├┬ Dt: <class 'numpy.dtype'>
││├┬ F2: <function _create_dtypemeta at 0x7fa9483a9f80>
│││└ # F2 [32 B]
││└ # Dt [39 B]
│├┬ D2: <dict object at 0x7fa92bce95a0>
││└ # D2 [60 B]
│├┬ D2: <dict object at 0x7fa92bce95f0>
││├┬ T4: <class 'gumbi.aggregation.Standardizer'>
│││└ # T4 [34 B]
││├┬ D2: <dict object at 0x7fa92bce9320>
│││└ # D2 [42 B]
││├┬ D2: <dict object at 0x7fa92bce9460>
│││├┬ D2: <dict object at 0x7fa92bce9550>
││││├┬ Nu: <ufunc 'log'>
│││││└ # Nu [13 B]
││││├┬ Nu: <ufunc 'exp'>
│││││└ # Nu [13 B]
││││└ # D2 [37 B]
│││└ # D2 [103 B]
││└ # D2 [222 B]
│└ # Nu [544 B]
└ # D2 [557 B]
('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]