Pickling class instance in dict fails unless you try pickling the object directly first

Question

Pickling class instance in dict fails unless you try pickling the object directly first

Closed this issue 2 years ago · 10 comments

I've found a very bizarre issue where, if I try to pickle an object (with various instance attributes) in a dictionary using dill, it fails on the first attempt. When un-pickling the object, it appears none of the instance attributes have been pickled along with the object. No matter how many times I re-run that cell (in a jupyter notebook), it fails. If I then try to pickle the same object directly, it works, and a subsequent attempt at pickling it in a dictionary now works too.

I've attached reproducible code below, and you can run it in this Colab notebook.

Note, I am the author of gumbi, so please feel free to let me know if this is some issue with gumbi itself!

Imports and setup

!pip install gumbi

import dill
import numpy as np
import gumbi as gmb

stdzr = gmb.Standardizer(d={'μ': -0.307, 'σ': 0.158}, log_vars=['d'])
pa = gmb.parray(d=np.arange(5,10)/10, stdzr=stdzr)
pa
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]

First attempt

Saving the object in a dictionary fails, the instance attributes seemingly don't get pickled:

with open('test.pkl', 'wb') as f:
    dill.dump({'pa': pa}, f)
    
with open('test.pkl', 'rb') as f:
    loaded = dill.load(f)
    
loaded['pa']
# AttributeError: 'ParameterArray' object has no attribute 'names'

A diversion

Saving the object on its own works:

with open('test.pkl', 'wb') as f:
    dill.dump(pa, f)
    
with open('test.pkl', 'rb') as f:
    loaded = dill.load(f)
    
loaded
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]

Second attempt

And now saving the object in a dictionary works:

with open('test.pkl', 'wb') as f:
    dill.dump({'pa': pa}, f)
    
with open('test.pkl', 'rb') as f:
    loaded = dill.load(f)
    
loaded['pa']
# ('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]

Note that running the "diversion" before ever running the dictionary pickle works too!

Answer 1 · 2022-08-10T23:14:59.000Z

Generally this shouldn't be the case. However, after a quick look at your code, I see that a ParameterArray is (1) a subclass of a numpy.ndarray, (2) uses __new__, and (3) adds the names attribute a bit more dynamically, with __array_finalize__ instead of during the usual __init__ call. All of these things can potentially cause issues for dill during serialization.

Just as an order of business, can you tell me which version of dill,python, and numpy you are seeing this with? Is it only seen when in a jupyter notebook? Does it occur only with dump/load, as opposed to dumps/loads? Also, I'd like to see if what you are experiencing can be reproduced using a dummy child class of a numpy.ndarray...

Answer 2 · 2022-08-10T23:21:06.000Z

For example, how about the following? (this is also a slightly different case, as it's a class defined in __main__ as opposed to in a file).

Python 3.10.6 (main, Aug  3 2022, 04:36:46) [Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> import numpy as np
>>> dill.__version__
'0.3.6.dev0'
>>> np.__version__
'1.23.1'
>>> class Foo(np.ndarray):
...   def __array_finalize__(self, *args, **kwds):
...     self.names = 'foo!'
... 
>>> f = Foo([1,2,3])
>>> d = dict(f=f)
>>> g = dill.copy(d)
>>> g['f'].names
'foo!'
>>> g = dill.copy(f)
>>> g.names
'foo!'

NOTE: the above also seems to work in a jupyter notebook.

Also, can you show the traceback for your failure and success cases when you turn on pickle tracing (i.e. dill.detect.trace(True) before calling dump)?

Answer 3 · 2022-08-11T00:48:04.000Z

Looks like I'm using slightly older dill and numpy versions than you, but your Foo example works for me as well (see updated Colab), so it's not quite the same problem. Versions and tracebacks below:

Versions:

dill:  0.3.5.1
numpy:  1.21.6
gumbi:  0.1.11

Full traceback:

First attempt:

D2: <dict object at 0x7fee7f2c4690>
INFO:dill:D2: <dict object at 0x7fee7f2c4690>
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    697                 type_pprinters=self.type_printers,
    698                 deferred_pprinters=self.deferred_printers)
--> 699             printer.pretty(obj)
    700             printer.flush()
    701             return stream.getvalue()

3 frames

/usr/local/lib/python3.7/dist-packages/gumbi/arrays.py in __repr__(self)
    245 
    246     def __repr__(self):
--> 247         return f'{tuple(self.names)}: {np.asarray(self)}'
    248 
    249     def get(self, name, default=None):

AttributeError: 'ParameterArray' object has no attribute 'names'

A diversion

Nu: ((5,), [('d', '<f8')])
INFO:dill:Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
INFO:dill:F2: <function _create_array at 0x7fee9b9dc440>
# F2
INFO:dill:# F2
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3fa0>
INFO:dill:D2: <dict object at 0x7fee7f2c3fa0>
T4: <class 'gumbi.aggregation.Standardizer'>
INFO:dill:T4: <class 'gumbi.aggregation.Standardizer'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3c80>
INFO:dill:D2: <dict object at 0x7fee7f2c3c80>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3e10>
INFO:dill:D2: <dict object at 0x7fee7f2c3e10>
D2: <dict object at 0x7fee7f2c3f00>
INFO:dill:D2: <dict object at 0x7fee7f2c3f00>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# Nu
INFO:dill:# Nu

('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]

Second attempt:

D2: <dict object at 0x7fee7f2ac0f0>
INFO:dill:D2: <dict object at 0x7fee7f2ac0f0>
Nu: ((5,), [('d', '<f8')])
INFO:dill:Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
INFO:dill:F2: <function _create_array at 0x7fee9b9dc440>
# F2
INFO:dill:# F2
T4: <class 'gumbi.arrays.ParameterArray'>
INFO:dill:T4: <class 'gumbi.arrays.ParameterArray'>
# T4
INFO:dill:# T4
T4: <class 'numpy.dtype'>
INFO:dill:T4: <class 'numpy.dtype'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3f50>
INFO:dill:D2: <dict object at 0x7fee7f2c3f50>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3fa0>
INFO:dill:D2: <dict object at 0x7fee7f2c3fa0>
T4: <class 'gumbi.aggregation.Standardizer'>
INFO:dill:T4: <class 'gumbi.aggregation.Standardizer'>
# T4
INFO:dill:# T4
D2: <dict object at 0x7fee7f2c3c80>
INFO:dill:D2: <dict object at 0x7fee7f2c3c80>
# D2
INFO:dill:# D2
D2: <dict object at 0x7fee7f2c3e10>
INFO:dill:D2: <dict object at 0x7fee7f2c3e10>
D2: <dict object at 0x7fee7f2c3f00>
INFO:dill:D2: <dict object at 0x7fee7f2c3f00>
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# D2
INFO:dill:# D2
# Nu
INFO:dill:# Nu
# D2
INFO:dill:# D2

('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]

Answer 4 · 2022-08-11T11:00:54.000Z

@leogama: (unrelated to the rest of the issue) I noticed in the above traceback that there are duplicate entries within the pickle logging trace. Have you seen this behavior before? I haven't seen it previously.

Answer 5 · 2022-08-11T11:14:40.000Z

@JohnGoertz: you can see from the traceback that there's a key difference in the pickling trace in the first attempt, and afterward.

The 'correct' trace identities a numpy subclass instance ('Nu: (obj.shape, obj.dtype)'), and stores the _create_array function:

D2: <dict object at 0x7fee7f2ac0f0>
Nu: ((5,), [('d', '<f8')])
F2: <function _create_array at 0x7fee9b9dc440>
# F2
T4: <class 'gumbi.arrays.ParameterArray'>
# T4
T4: <class 'numpy.dtype'>
# T4
...

while the 'incorrect' trace misses the same identification, and just goes directly to the ParameterArray:

D2: <dict object at 0x7fee7f2c4690>
T4: <class 'gumbi.arrays.ParameterArray'>
# T4
T4: <class 'numpy.dtype'>
# T4
...

So, I am going to assume this is a bug in identifying subclasses of numpy arrays, as dill has some "special handling" for that (i.e. it's slightly hackish compared to the handling on other python subclasses). [NOTE: I've deleted the 'INFO' lines from my snippets of your traces, for clarity] My first suspect is this function ndarraysubclassinstance(obj).

Answer 6 · 2022-08-11T12:04:03.000Z

@leogama: (unrelated to the rest of the issue) I noticed in the above traceback that there are duplicate entries within the pickle logging trace. Have you seen this behavior before? I haven't seen it previously.

This happens when multiple logging.Handlers are handling the logs to stdout/stderr. The OP likely has added a custom StreamHandler to the root logger (logging.root) and the tracing messages from dill are propagating ("bubbling") up to it. I've been thinking of using a separate logger, disconnected from the logging manager, for the pickling trace, and using the logging.getLogger('dill') only for legitimate messages to the user, like reporting which objects were saved by reference because they failed pickling with the refonfail option. But to fix this by now it's sufficient to set logger.propagate = False.

Answer 7 · 2022-08-11T14:02:45.000Z

I see, interesting. It makes sense that it could fail that way, I think the weirder thing is that the "diversion" fixes it downstream!

Answer 8 · 2022-08-11T16:00:35.000Z

That's probably because, currently, the custom save functions for numpy classes are only registered if the object is pickled directly with the Pickler.dump() method, which is called by the dump(), dumps() and copy() functions, but not if the object is part of another object. Also, the methods __reduce_ex_() and __reduce__() from numpy.ndarray (and from gumbi.arrays.*Array for that matter) don't preserve the array's __dict__ state.

This explains why, only after you call dill.dump(pa, f), the then registered dill's save_numpy_array() starts to be used for serializing other ParameterArray objects. The relevant code in Piclker.dump() should probably be moved to an overriding Pickler.save() method, which is called for every object pickled.

Answer 9 · 2022-08-11T17:35:21.000Z

@JohnGoertz: Try your code after updating to master (with the latest commit: bf0f4fa), which should now trigger the registering of the numpy array subclass inside the dict.

>>> import numpy as np
>>> import dill 
>>> 
>>> class Foo(np.ndarray):
...   def __array_finalize__(self, *args, **kwds):
...     self.names = 'foo!'
... 
>>> f = Foo([1,2,3])
>>> dill.detect.trace(True)
>>> d = dict(f=f)
>>> g = dill.copy(d)
┬ D2: <dict object at 0x0112e5af00>
├┬ Nu: ((1, 2, 3), float64)
│├┬ F2: <function _create_array at 0x112da63b0>
││└ # F2 [28 B]
│├┬ T2: <class '__main__.Foo'>
││├┬ F2: <function _create_type at 0x112d9f5f0>
│││└ # F2 [27 B]
││├┬ T1: <class 'type'>
│││├┬ F2: <function _load_type at 0x112d9f4d0>
││││└ # F2 [25 B]
│││└ # T1 [42 B]
││├┬ T4: <class 'numpy.ndarray'>
│││└ # T4 [17 B]
││├┬ D2: <dict object at 0x01089fc5f0>
│││├┬ F1: <function Foo.__array_finalize__ at 0x112e5cdd0>
││││├┬ F2: <function _create_function at 0x112d9f560>
│││││└ # F2 [31 B]
││││├┬ Co: <code object __array_finalize__ at 0x1087e94b0, file "<stdin>", line 2>
│││││├┬ F2: <function _create_code at 0x112d9f680>
││││││└ # F2 [27 B]
│││││└ # Co [150 B]
││││├┬ D1: <dict object at 0x0108763c80>
│││││└ # D1 [22 B]
││││├┬ D2: <dict object at 0x0112c3a2d0>
│││││└ # D2 [3 B]
││││├┬ D2: <dict object at 0x0112e5a960>
│││││├┬ D2: <dict object at 0x0112e5a9b0>
││││││└ # D2 [3 B]
│││││└ # D2 [78 B]
││││└ # F1 [299 B]
│││└ # D2 [376 B]
││└ # T2 [482 B]
│├┬ Dt: <class 'numpy.dtype'>
││├┬ F2: <function _create_dtypemeta at 0x112da6440>
│││└ # F2 [32 B]
││└ # Dt [39 B]
│├┬ D2: <dict object at 0x0112c33c30>
││└ # D2 [8 B]
│└ # Nu [730 B]
└ # D2 [742 B]
>>> g['f'].names
'foo!'

Answer 10 · 2022-08-11T23:01:08.000Z

That seems to have done the trick! It now pickles the object correctly within the dictionary on the first try! I should also note that prior to this commit, the same issue arose when the object was contained in a list, but that is fixed now too.

!pip install git+https://github.com/uqfoundation/dill.git@bf0f4fa48358fac0a9246a3f12575767d1d14b6f
!pip install gumbi

import dill
import numpy as np
import gumbi as gmb

print('dill: ', dill.__version__) 
print('numpy: ', np.__version__) 
print('gumbi: ', gmb.__version__) 
dill.detect.trace(True)

# dill:  0.3.6.dev0
# numpy:  1.21.6
# gumbi:  0.1.11

stdzr = gmb.Standardizer(d={'μ': -0.307, 'σ': 0.158}, log_vars=['d'])
pa = gmb.parray(d=np.arange(5,10)/10, stdzr=stdzr)
pa

with open('test.pkl', 'wb') as f:
    dill.dump({'pa': pa}, f)
    
with open('test.pkl', 'rb') as f:
    loaded = dill.load(f)
    
loaded['pa']

Output:

┬ D2: <dict object at 0x7fa92bce5050>
├┬ Nu: ((5,), [('d', '<f8')])
│├┬ F2: <function _create_array at 0x7fa9483a9ef0>
││└ # F2 [28 B]
│├┬ T4: <class 'gumbi.arrays.ParameterArray'>
││└ # T4 [31 B]
│├┬ Dt: <class 'numpy.dtype'>
││├┬ F2: <function _create_dtypemeta at 0x7fa9483a9f80>
│││└ # F2 [32 B]
││└ # Dt [39 B]
│├┬ D2: <dict object at 0x7fa92bce95a0>
││└ # D2 [60 B]
│├┬ D2: <dict object at 0x7fa92bce95f0>
││├┬ T4: <class 'gumbi.aggregation.Standardizer'>
│││└ # T4 [34 B]
││├┬ D2: <dict object at 0x7fa92bce9320>
│││└ # D2 [42 B]
││├┬ D2: <dict object at 0x7fa92bce9460>
│││├┬ D2: <dict object at 0x7fa92bce9550>
││││├┬ Nu: <ufunc 'log'>
│││││└ # Nu [13 B]
││││├┬ Nu: <ufunc 'exp'>
│││││└ # Nu [13 B]
││││└ # D2 [37 B]
│││└ # D2 [103 B]
││└ # D2 [222 B]
│└ # Nu [544 B]
└ # D2 [557 B]

('d',): [(0.5,) (0.6,) (0.7,) (0.8,) (0.9,)]