uqfoundation/dill

RecursionError after upgrade to version 0.3.6

Closed this issue · 6 comments

After upgrading dill to version 0.3.6 we get occasionally a RecursionError.

We use dill in the context of a scientific simulation, where the entire simulation is stored as a Python object. We regularly write this object to file in order to be able to resume a simulation from a previous state in case of an unexpected crash.

with open(path_to_file, "wb") as f:
        dill.dump(obj, f)

Since version 0.3.6 it can happen (0.3.5 and 0.3.5.1 work fine) that dumping the object to disk causes a recursion error which seems to be bouncing back and forth between pickle and dill. Here are the last few lines of the traceback:

File ~/anaconda3/envs/j/lib/python3.11/site-packages/dill/_dill.py:381, in Pickler.save.<locals>.save_numpy_array(pickler, obj)
    379 npdict = getattr(obj, '__dict__', None)
    380 f, args, state = obj.__reduce__()
--> 381 pickler.save_reduce(_create_array, (f,args,state,npdict), obj=obj)
    382 logger.trace(pickler, "# Nu")
    383 return

File ~/anaconda3/envs/j/lib/python3.11/pickle.py:692, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj)
    690 else:
    691     save(func)
--> 692     save(args)
    693     write(REDUCE)
    695 if obj is not None:
    696     # If the object is already in the memo, this means it is
    697     # recursive. In this case, throw away everything we put on the
    698     # stack, and fetch the object back from the memo.

File ~/anaconda3/envs/j/lib/python3.11/site-packages/dill/_dill.py:388, in Pickler.save(self, obj, save_persistent_id)
    386     msg = "Can't pickle %s: attribute lookup builtins.generator failed" % GeneratorType
    387     raise PicklingError(msg)
--> 388 StockPickler.save(self, obj, save_persistent_id)

File ~/anaconda3/envs/j/lib/python3.11/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id)
    558 f = self.dispatch.get(t)
    559 if f is not None:
--> 560     f(self, obj)  # Call unbound method with explicit self
    561     return
    563 # Check private dispatch table if any, or else
    564 # copyreg.dispatch_table

File ~/anaconda3/envs/j/lib/python3.11/pickle.py:902, in _Pickler.save_tuple(self, obj)
    900 write(MARK)
    901 for element in obj:
--> 902     save(element)
    904 if id(obj) in memo:
    905     # Subtle.  d was not in memo when we entered save_tuple(), so
    906     # the process of saving the tuple's elements must have saved
   (...)
    910     # could have been done in the "for element" loop instead, but
    911     # recursive tuples are a rare thing.
    912     get = self.get(memo[id(obj)][0])

File ~/anaconda3/envs/j/lib/python3.11/site-packages/dill/_dill.py:348, in Pickler.save(self, obj, save_persistent_id)
    346 obj_type = type(obj)
    347 if NumpyArrayType and not (obj_type is type or obj_type in Pickler.dispatch):
--> 348     if NumpyUfuncType and numpyufunc(obj_type):
    349         @register(obj_type)
    350         def save_numpy_ufunc(pickler, obj):
    351             logger.trace(pickler, "Nu: %s", obj)

File ~/anaconda3/envs/j/lib/python3.11/site-packages/dill/_dill.py:105, in numpyufunc(obj_type)
    104 def numpyufunc(obj_type):
--> 105     return any((c.__module__, c.__name__) == ('numpy', 'ufunc') for c in obj_type.__mro__)

RecursionError: maximum recursion depth exceeded

The simulation object has a tree structure and a certain kind of recursion can happen within the object with attributes referencing each other. The RecursionError occurs with recurse=True, as well as with recurse=False.

What confuses us is that it only happens sometimes and we are not able to figure out what exactly causes it.

For example: Our module has test cases. Here is a notebook to test the gas evolution in a protoplanetary disk:
https://github.com/stammler/dustpy/blob/master/examples/test_gas_evolution.ipynb
The notebook is running four small simulations with only some minor changes between them. The first two work without a problem, while the last two raise a RecursionError when writing the dump file at some point in the simulation. But always at the same point.

Increasing the recursion limit would help

import sys
sys.setrecursionlimit(4000)

We are also considering requiring the dill=0.3.5.1 in the requirements or placing the dumping in try: ... except: ... since it is not crucial for the simulation to raise a critical error.

However, we would like to understand what is going on and possibly fix it. Is this a problem on our side? Or is this an issue with dill?
What could cause such a behavior?

I have a similar issue, are there any updates on this ?

I am trying to provide a Minimal Reproducible Example, but I didn't manage to do it yet...

增大递归限制就可以了
sys.setrecursionlimit(5000)

@eosjust: did you test that increasing the recursion limit mitigates the issue, or are you just suggesting it as a potential fix?

Increasing the recursion limit does mitigate the problem. But the larger recursion limit can still be hit.
I personally would see this more as a workaround than a fix.

Apparently, there was an issue that was fixed for version 0.3.6 that was short-circuiting the long recursion trace. So, I assume the thing to do would be to set a really large recursion limit, and then look at the pickling trace (i.e. dill.detect.trace(True)) and the actual serialized object that is created... and then attempt to create a short-circuit that yields the same result.

Thank you for the hint with dill.detect.trace(True).

We found a recursion in our object, which was storing a copy of itself instead of a reference, which could accumulate over time. After fixing this the recursion error was gone. We are still not quite sure, why it only happened after version 0.3.6 and only sometimes and only with dill, not with pickle.

From our side this issue could be closed, but @bashirmindee reported a similar problem.