uqfoundation/dill

Importing dill changes the python pickler behavior

Opened this issue · 8 comments

import io
import pickle
from collections import namedtuple

SomeTuple = namedtuple('SomeTuple', 'a b')

thing = SomeTuple(1, 2)

f = io.BytesIO()
pickle._Pickler(f).dump(thing)
before_dill = f.getvalue()

import dill

f = io.BytesIO()
pickle._Pickler(f).dump(thing)
after_dill = f.getvalue()

assert before_dill == after_dill

The following shows the issue pretty clearly.

In [1]: import pickle; len(pickle._Pickler.dispatch)
Out[1]: 15

In [2]: import dill; len(pickle._Pickler.dispatch)
Out[2]: 63

In [3]: 

Dill is somehow not actually taking a copy of the pickler dispatch table - despite attempting to.

This is the intended behavior. It's partially due to historical choices with python 2.x, where just by an import dill one could then extend the registry so that any python module that used pickle could utilize dill serialization without being modified or worked around. If this behavior is not what you want, then there's a function that will remove the injected types from the pickle registry:

Python 3.8.19 (default, Mar 22 2024, 01:55:18) 
[Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> import pickle
>>> from collections import namedtuple
>>> SomeTuple = namedtuple('SomeTuple', 'a b')
>>> thing = SomeTuple(1, 2)
>>> f = io.BytesIO()
>>> pickle._Pickler(f).dump(thing)
>>> before_dill = f.getvalue()
>>> len(pickle._Pickler.dispatch)
15
>>> 
>>> import dill
>>> dill.extend(False)
>>> f = io.BytesIO()
>>> pickle._Pickler(f).dump(thing)
>>> after_dill = f.getvalue()
>>> assert before_dill == after_dill
>>> len(pickle._Pickler.dispatch)
15

This is an incredibly unfortunate default and I think it should be reconsidered.

As a user I do not control the behavior of third party libraries (dill and otherwise).

I may have neither a direct dependency on dill or pickle and be bitten by this behavior.

Consider the following:

  • I use a third party library that uses pickle._Pickler internally
  • I use another third party library that imports dill
  • I am not directly use either pickle or dill in my own code
  • I use this third party library to send an object to another process which doesn't have dill available in its environment
  • Things explode

I agree it's an unfortunate default. We looked at the impact of changing a poor decision made over a decade ago in a different ecosystem, and it'd also be very disruptive to reverse it. Hence, the addition of the "extend" function, even it has an unfortunate default.

This behavior in dill breaks our software, despite our software not using dill.

We get dill transitively imported in some applications but not others depending on whether another third party library depends on it.

Given it seems this behavior won't be changed in dill, the only path forward for users in my position is to try and convince other library authors to avoid using dill themselves?

Thoughts on something like the following in dill/__init__.py

try:
    import dill_noextend
except ImportError:
    extend()

I thought about an environment variable, but I think the above is more desirable, since it allows users to make the decision once at environment creation time.

That's an interesting idea.

I don't understand your particular case, so maybe that would help as well

In general, importing dill then using extend(False) should globally drop the dill extensions to the registry, even if your code doesn't use dll otherwise. If another library is using dill for serialization, it also is using a modified Pickler and Unpickler (i.e. dill subclasses those of pickle)... so if an object is serialized with dill, one should also have dill installed when you unpickle the object.

Some thought needs to go into what the impact of a dependency-based global switch is. While it would seem that the right thing to do in the abstract is to not extend the registry by default... an even better solution is to not extend the pickle registry at all (unfortunately, that's not feasible). Practically, regardless of what you do when dill is imported, either inject into the registry by default or not, modifying the global pickle registry can have an impact on other codes. It's not really a problem specific to dill -- any code that modifies the global pickle registry by adding a new type is going to cause the same kind of issue, I believe.... so the most conservative thing I can think of is to in your own code, explicitly set a switch to extend the registry when needed, and to explicitly 'clear' the registry when that's needed. dill is the only code that I know of that gives that as an option (although an improved behavior would be to return the extended types (from dill or elsewhere) as a dict, then extend take True for dill, or a dict (returned earlier).

I don't understand your particular case, so maybe that would help as well

The scenario is something like this:

  • we have a large active codebase with many hundreds of scripts, with new scripts being created frequently
  • nothing in our own codebase uses dill
  • we have a component that uses pickle._Pickler and depends on consistent pickle output between different scripts in our codebase
  • some scripts depend on third party libraries that transitively import dill

This results in a scenario where sometimes dill is imported and sometimes it is not, which changes the pickle output between scripts.

I don't want to import dill and unregister in the component that uses pickle._Pickler because that component is also used in environments that don't have dill installed.

So my options are vaguely:

  • completely ban dill from the environment and try to get it and any third party libraries that use it uninstalled
  • go fix all scripts that transitively import dill to call unregister and hope more don't pop up
  • add some try/except dill import to our _Pickler using component

None of those are very great - especially the last, since I really don't want to continue propagating the python antipattern of "try to import some thing and silently swallow the exception if it isn't there" within our codebase since it is already a great source of pain.

A better option for us would be to disable this behavior in dill outright. The existence of an empty file in our python environment called dill_noextend.py that signals to dill "if you are imported, please don't screw with the python pickler" would do exactly that for us.


and now for something completely different

I'm not really sure why this default cannot be changed - it already seems to be not functioning as intended?

import io
import pickle
from collections import namedtuple

SomeTuple = namedtuple('SomeTuple', 'a b')

thing = SomeTuple(1, 2)

f = io.BytesIO()
pickle._Pickler(f).dump(thing)
before_dill = f.getvalue()

import dill

f = io.BytesIO()
pickle.Pickler(f).dump(thing)
after_dill_native = f.getvalue()
assert before_dill == after_dill_native

f = io.BytesIO()
pickle._Pickler(f).dump(thing)
after_dill_python = f.getvalue()
assert before_dill != after_dill_python

dill only extends the pure-python pickler, not the native pickler implementation, which is contrary to how this is described in the docs.

Given that you have to go very much out of your way to use the python pickle implementation, and most people will have the native implementation, I can't imagine very many people are relying on this extend behavior anymore.

I believe all serialization packages, dill included, globally edit the pickle registry by adding new types. python gives the functionality to do that with copy_reg. If any of the third party codes use something that edits the pickle registry breaks your usage model, then it would be helpful to understand what your exact issue is. If that's an unreasonable ask, my suggestion can only be that you take action on your end (i.e. change your model, or ban any third party code from editing the pickle registry). However, if you can clarify what your exact issue is, and why it's an issue, there might be potential to come up with a solution that could go into dill.

Auto-extending the registry originated in python 2.x, and has never extended the native registry. When python switched pickle.Pickler to use the native pickler, and moved the python pickler to pickle._Pickler, ultimately from our perspective that was good in that it minimized the impact of any serialization package on the default global registry (i.e. the native one in python 3.x). I agree, it's a corner case to depend on the integrity of the registry by using pickle._Pickler... so it would help to understand why exactly you are doing that, as to help inform the most appropriate mitigation. I'll reopen this issue, as it would seem there is more to discuss here.