uqfoundation/dill

Cannot use callable that was pickled within pytest

dionhaefner opened this issue ยท 14 comments

I am running tests that serialize callables with dill and try to load them in a subprocess to make sure everything worked correctly. I am getting a cryptic error when trying to load the callable from the subprocess, presumably because dill is failing to load the test module.

Example:

# save as dill_test.py
import sys
import tempfile
from textwrap import dedent

def foo():
    pass

def test_dill():
    import subprocess
    import dill

    with tempfile.TemporaryDirectory() as tmpdir:
        picklefile = f"{tmpdir}/foo.pickle"

        with open(picklefile, "wb") as f:
            f.write(dill.dumps(foo))

        test_script = dedent(f"""
        import dill
        with open("{picklefile}", "rb") as f:
            func = dill.load(f)
        func()
        """)

        subprocess.run([sys.executable, "-c", test_script], check=True)

if __name__ == "__main__":
    test_dill()
    print("ok")

Calling through pytest gives this error:

$ pytest dill_test.py
E               subprocess.CalledProcessError: Command '['/Users/dion/.virtualenvs/py312/bin/python', '-c', '\nimport dill\nwith open("/var/folders/fk/g5ssrkz179z1mjmvqn1j3q1m0000gn/T/tmphuyt802o/foo.pickle", "rb") as f:\n    func = dill.load(f)\nfunc()\n']' returned non-zero exit status 1.

/opt/homebrew/Cellar/python@3.12/3.12.0/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py:571: CalledProcessError
-------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/Users/dion/.virtualenvs/py312/lib/python3.12/site-packages/dill/_dill.py", line 287, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dion/.virtualenvs/py312/lib/python3.12/site-packages/dill/_dill.py", line 442, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dion/.virtualenvs/py312/lib/python3.12/site-packages/dill/_dill.py", line 432, in find_class
    return StockUnpickler.find_class(self, module, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'dill_test'
========================================================================= short test summary info =========================================================================
FAILED tests/dill_test.py::test_dill - subprocess.CalledProcessError: Command '['/Users/dion/.virtualenvs/py312/bin/python', '-c', '\nimport dill\nwith open("/var/folders/fk/g5ssrkz179z1mjmvqn1j3q1m0000gn/...

Calling it directly works:

$ python dill_test.py
ok

Funnily enough, it works when I do this before pickling:

foo.__globals__.pop(foo.__name__)

I want to make sure I'm understanding this correctly, but running your script normally works, however if you run under the control of pytest (and subprocess), it throws the error above. Is that correct? If so, I'd be interested to run with dill.detect.trace(True).

That's what I thought, but now I realized this is actually a pathing issue.

$ python tests/dill_test.py
ok

$ cd tests
$ pytest dill_test.py
ok

$ pytest tests/dill_test.py
NOT OK

So in the latter case, dill.load tries to import dill_test.py but fails because it's not on sys.path. It is fixed by changing the load script to this:

test_script = dedent(f"""
        import dill
        import sys
        sys.path.append("{os.path.dirname(__file__)}")
        with open("{picklefile}", "rb") as f:
            func = dill.load(f)
        func()
""")

Is there a way to pickle a function so it can be executed even if the original module isn't available when unpickling?

Generally, dill assumes that module dependencies are installed... and while it does provide different approaches for tracing dependencies in the global scope... what you might be able to do in any case is to dump the module along with the function. Then you'd load the module and then the function. Something like this is only needed for "uninstalled" modules. This is ok for saving state, but not really that good for parallel computing.

Generally, dill assumes that module dependencies are installed.

But why is this module a dependency in the first place? The function doesn't access any globals.

The global dict is required to create a function object.

Python 3.8.18 (default, Aug 25 2023, 04:23:37) 
[Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import types
>>> print(types.FunctionType.__doc__)
Create a function object.

  code
    a code object
  globals
    the globals dictionary
  name
    a string that overrides the name from the code object
  argdefs
    a tuple that specifies the default argument values
  closure
    a tuple that supplies the bindings for free variables
>>> 

However, dill has different settings that modify how the global dict is handled. So, you can try dill.settings['recurse'] = True, which will only pickle items in the global dict that are pointed to by the function, and otherwise stores a dummy global dict.

Thanks, I think I understand the problem now. recurse=True doesn't work but I guess that's due to some modifications done to the callable by pytest.

you can often see what's going on with dill.detect.trace(True)

Okay here goes nothing.

This is the case that works:

$ python tests/dill_test.py
โ”ฌ F1: <function foo at 0x102580040>
โ”œโ”ฌ F2: <function _create_function at 0x102fb32e0>
โ”‚โ”” # F2 [34 B]
โ”œโ”ฌ Co: <code object foo at 0x102755b00, file "/private/tmp/tests/dill_test.py", line 6>
โ”‚โ”œโ”ฌ F2: <function _create_code at 0x102fb3370>
โ”‚โ”‚โ”” # F2 [19 B]
โ”‚โ”” # Co [102 B]
โ”œโ”ฌ D2: <dict object at 0x0102fc49c0>
โ”‚โ”” # D2 [25 B]
โ”œโ”ฌ D2: <dict object at 0x0102956a00>
โ”‚โ”” # D2 [2 B]
โ”œโ”ฌ D2: <dict object at 0x0102fc4b80>
โ”‚โ”œโ”ฌ D2: <dict object at 0x0102938ac0>
โ”‚โ”‚โ”” # D2 [2 B]
โ”‚โ”” # D2 [23 B]
โ”” # F1 [198 B]

This is the one that doesn't:

$ pytest tests/dill_test.py
โ”ฌ F2: <function foo at 0x104473be0>
โ”” # F2 [20 B]

So if pytest is involved, dill doesn't even try to pickle any of the function's attributes...?

Essentially, yes. "F2" is passing the function off to pickle. The key is that there's an internal function called _locate_function, and if that returns False... probably in this case because _import_module does not find the module... then it punts to pickle which gives up.

Isn't it the other way around? According to https://github.com/uqfoundation/dill/blob/master/dill/_dill.py#L1881C12-L1881C12, dill uses the stock pickler when _locate_function returns True. But this is not what I want, since I want to dump the function object itself, not a reference to it.

Yes, you are correct. I missed the not in the if statement.

Could you imagine having a flag similar to byref for modules that forces dill to pickle the function object instead of a reference to it? I think this would get us a lot closer to what we want to achieve.

yes, there is a PR that is mostly done that handles a bunch of module serialization variants. work on it seems to have stalled a bit though.