Overflow Error with Pickle Protocol 3

I am building a data pipeline with redun where large numpy.ndarrays are passed through different task functions. However, I can very easily hit this overflow exception:

OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

The cause of this issue seems straightforward - redun limits the pickle protocol to version 3

redun/redun/utils.py

Line 40 in 0cd06c8

PICKLE_PROTOCOL = 3

which cannot serialize objects beyond 4 GiB in size. Whenever redun tries to serialize a decent sized numpy array, either the pickle_dumps or pickle_dump function cannot handle the object. It seems like this serialization is used in value hashing, so any use of a large ndarray breaks redun even without caching the arrays. I can provide code to re-create this, but it's pretty easy to do.

Are there any plans to fix this constraint? Passing large arrays/tensors beyond 4 GiB in size is very common, and this seems like a blindspot in redun that would be hit frequently. It's certainly the largest blocker for us being able to use redun, which we would love to do given how elegant and straightforward the library is so far (and the lovely script tasks!).

Some ideas for solutions:

A quick solution would probably be to just bump up the protocol version from 3 to 4. Version 4 released in python 3.4, which is well within the current supported python versions. However, I think this would also break all previous cache values produced by redun, which is probably unacceptable.
Customizing Pickler/Unpickler instances to use a higher protocol in certain situations
Switching to a more robust serialization tool like dill or cloudpickle (same problems as before with cache breakage)
Perhaps redun can wrap large objects such that only their hashes are included in the upstream expressions (which would require Value-level serialization beyond pickle 3).

I'm not really an expert in this, so perhaps these ideas are not workable in redun.

Another easy(er) workaround would be to use file-based objects, but this would lead to tons of file I/O and all arrays would have to be stored on the hard drive, even ones that don't need to be cached. Not to mention that the constant "read file -> perform operation -> save file" loop is very clunky and makes code unreadable and inflexible. I've also tried making a custom ProxyValue class that did custom serialization, but I just kept running in circles between the type registry, the argument preprocessing, the differences between serialization, deserialization, __setstate__, __getstate__, etc.

I don't think my software version matters, but just to be thorough:

Python - 3.10.5
redun - 0.8.16
OS - Ubuntu Linux 20.04.1 x86_64 kernel 5.15.0

Thanks for sharing this @TylerSpears.

We originally picked pickle version 3 for backwards compatibility, but as you point out it has this limitation on data size.

As you correctly point out, changing the pickle version is a breaking change since it changes many hashes for values.

We have gathered several design improvements that break hashes and are considering if some future redun version should adopt them all in one go, but haven't settled on a plan just yet.

Thanks for sharing some suggestions.

Switching to a more robust serialization tool like dill or cloudpickle (same problems as before with cache breakage)

My understanding of cloudpickle is that it is appropriate for over-the-wire serialization, but is problematic for persistence (to a db) because it requires the same python interpreter version for serialization and deserialization.

Perhaps redun can wrap large objects such that only their hashes are included in the upstream expressions (which would require Value-level serialization beyond pickle 3).

This a good idea that we have be considered for other benefits as well. Basically, we have though through a ValueRef-like object that references the large value. The large value is fetched lazily as needed. Expressions containing the ValueRef can then remain small.

I've also tried making a custom ProxyValue

Subclassing ProxyValue is official way to customize serializing, hash, etc. We have one builtin class, FileCache, which stores the serialization in a File. That may work for you or inspire a solution. Looking at that code again I see it needs a get_hash() method added that avoids calling pickle_dumps, but that is an easy addition. Let me know if this class gives you a sufficient workaround until the pickle version is upgraded.

Just chiming in to say my team would also benefit from a ValueRef type of object, both for large objects, and also for objects with class dependencies not installed in the environment where the main redun scheduler is running.