aiidateam/disk-objectstore

`PackedObjectReader` does not implement `readline` - fails unpicking!

Opened this issue · 2 comments

In a use case where an output of a calculation (a GeneralData type of an aiida-workgraph task) is automatically pickled and used as an input by a subsequent calculation, pickle fails to unpickle the pickled output due to a missing readline method on the file handler, which is of type disk-objectstore.utils.PackedObjectReader.

After chatting with @sphuber, it is understood that due to an optimization step by the disk-objectstore, the file handler changed from io.BufferedReader to disk_objectstore.utils.PackedObjectReader. Unfortunately, as I'm unclear as to when this optimization takes place, I am unable to reproduce the issue at the moment, .

In any case, though disk_objectstore.utils.PackedObjectReader implements read, it does not implement readline, both required by pickle for unpickling. As such, pickle fails to unpickle the output.

@sphuber recommends/suggests that disk_objectstore.utils.PackedObjectReader is revisited to ensure it can be treated on an equal footing as an io.BufferedReader.

Unfortunately, as I'm unclear as to when this optimization takes place, I am unable to reproduce the issue at the moment, .

The crux is that the bug will only appear when the pickled object is part of a pack. If it is a loose file, the API will return a normal file handle when opening it. However, if it is part of a pack, the PackedObjectReader is used instead, which doesn't implement readline.

The following script provides a minimal working example:

#!/usr/bin/env python
import disk_objectstore
import pickle
import tempfile

with tempfile.TemporaryDirectory() as dirpath:

    # Initialize a disk objectstore container
    container = disk_objectstore.Container(dirpath)
    container.init_container()

    # Pickle an arbitrary object (a random string in this example) and write directly to a pack file
    pickled_container = pickle.dumps('some-string')
    if True:
        key = container.add_objects_to_pack([pickled_container])[0]
    else:
        key = container.add_object(pickled_container)

    # Retrieve the pickled byte stream and try to unpickle it
    with container.get_object_stream(key) as handle:
        pickle.load(handle)

If the conditional is True it writes the pickled object directly to a pack and then the pickle.load call will except

Traceback (most recent call last):
  File "scripts/run_pickle_dos.py", line 21, in <module>
    pickle.load(handle)
TypeError: file must have 'read' and 'readline' attributes

If we switch it to False such that the pickled object is written to a loose file instead, the code will work just fine.

So I think we just need to implement readline on the PackedObjectReader class and we should be golden

Thanks @sphuber. I'll open a PR draft soon and give it a go.