uqfoundation/dill

Inconsistent restoration of types and class overwriting side effects

Opened this issue · 5 comments

This is not strictly a bug, but may not be the expected behavior. There are 3 possibilities when unpickling a type object (here named "atype"):

  1. There is no class named atype.__name__ in the module atype.__module__*;
  2. There is a class with the same name and it is different from atype.
  3. There is a class with the same name and it is identical to atype.

(*) or the module can't be found, or the type object has no __module__ attribute, like enums and namedtuples can be.

The first and second cases should be straightforward to deal with by just creating or overwriting the class, respectively. In the third case, I think it's open for debate which solutions make more sense in different situations. The issue is that dill currently doesn't have a single orientation in this regard, with mixed outcomes.

The example below is based on the last commit from #450 and includes an Enum, a namedtuple —and I don't know if dataclass is another special case—, a custom class that doesn't check the objects' classes when testing for equality and a custom class that does it.

class_demo.py

from enum import Enum
Status = Enum('Status', 'OK ERROR', module=__name__)

from collections import namedtuple
Point = namedtuple('Point', 'x y', module=__name__)

class Triangle(frozenset):
    def __new__(cls, a, b, c):
        return super().__new__(cls, (a, b, c))

from weakref import WeakSet
class TypedWeakSet(WeakSet):
    def __init__(self, type, data=None):
        self.type = type
        self._error_msg = "items must be of type %r" % self.type
        if data is not None and not all(isinstance(x, self.type) for x in data):
            raise ValueError(self._error_msg)
        super().__init__(data)
    def add(self, item):
        if not isinstance(item, self.type):
            raise ValueError(self._error_msg)
        super().add(item)

The code below shows what happens when comparing "identical" objects created before and after the state of the module containing its classes is saved and restored.

import dill, class_demo as mod

ok1 = mod.Status.OK
p11, p12, p13 = mod.Point(0,0), mod.Point(4,0), mod.Point(0,3)
t1 = mod.Triangle(p11, p12, p13)
s1 = mod.TypedWeakSet(mod.Triangle)
s1.add(t1)

dill.dump_session(main=mod)
dill.load_session(main=mod)

ok2 = mod.Status.OK
p21, p22, p23 = mod.Point(0,0), mod.Point(4,0), mod.Point(0,3)
t2 = mod.Triangle(p21, p22, p23)
s2 = mod.TypedWeakSet(mod.Triangle)
s2.add(t2)

print(f"{type(ok1).__name__:12}  {(ok1 == ok2) = !s:5};  {type(ok1) is type(ok2) = }")
print(f"{type(p11).__name__:12}  {(p11 == p21) = !s:5};  {type(p11) is type(p21) = }")
print(f"{type(t1).__name__:12}  {(t1 == t2)   = !s:5};  {type(t1)  is type(t2)  = }")
print(f"{type(s1).__name__:12}  {(s1 == s2)   = !s:5};  {type(s1)  is type(s2)  = }")

Output:

Status        (ok1 == ok2) = False;  type(ok1) is type(ok2) = False
Point         (p11 == p21) = True ;  type(p11) is type(p21) = True
Triangle      (t1 == t2)   = True ;  type(t1)  is type(t2)  = False
TypedWeakSet  (s1 == s2)   = False;  type(s1)  is type(s2)  = False

Therefore, as of today dill always prefer the existing namedtuple type definition, even if the one being unpickled is different. On the other hand, it always prefer the unpickled type object in all other cases, overwriting the existing one even if they are identical (in the sense that they would generate identical pickle streams), which may or may not affect operations between instances of different versions of the same class.


I would prefer not to think about this kind of stuff, but I'm unable to concentrate in anything else before I write it down and share with you...

The type(p11) is type(p21) = True is most certainly a bug, unless namedtuple does something strange that I am unaware of at this time. The current semantics are that, upon unpickling, all main classes and functions are reconstructed by default. This awkwardness related to object equality is the reason that the byref setting exists.

Perhaps some special care needs to be taken to handle byref during session loading. Didn't look into the details.

Yeah, I have to agree. The namedtuple behavior is not what should be expected. In terms of settings, byref (and most other settings) is intended to be active during the dump, with only ignore being active during the load. The ignore settings provides somewhat related behavior:

    If *ignore=False* then objects whose class is defined in the module
    *__main__* are updated to reference the existing class in *__main__*,
    otherwise they are left to refer to the reconstructed type, which may
    be different.

So, there's potential in using the settings to control the desired load behavior, if alternate load behavior is desirable.

The type(p11) is type(p21) = True is most certainly a bug, unless namedtuple does something strange that I am unaware of at this time.

It's here:

dill/dill/_dill.py

Lines 1238 to 1241 in 13e3f80

def _create_namedtuple(name, fieldnames, modulename, defaults=None):
class_ = _import_module(modulename + '.' + name, safe=True)
if class_ is not None:
return class_

Perhaps some special care needs to be taken to handle byref during session loading. Didn't look into the details.

byref in session saving doesn't apply to objects only present in the module being saved (and not present in other modules of sys.modules). By the way, I just realized that maybe the module __main__ should be excluded from the modmap when saving other modules...

I guess that named tuple is fine then. What should have happened is that the check to see if the named tuple was accessible should have been done in save_type and not here. Now namedtuple pickling is a little inconsistent, but changing it would break compatibility. Maybe we should put the check in both places, so older pickles still load but new ones are consistent?

Nope. It still seems that byref chooses which types in the main module should be pickled and which ones should be referenced as a global. @mmckerns, should this be extended to functions too?

if not _byref and not obj_recursive and incorrectly_named: # not a function, but the name was held over

@leogama: A while ago, I posed the same question, and I think the answer is that it is probably a useful thing to do. See: #90, #102, and #128. We should think about changing namedtuple pickling to be consistent, as long as the old pickled nametuples can get loaded. dill has only been able to handle a nametuple for a short time, and it's a behavioral change, but not a breaking change.