msgpack/msgpack-python

unpackb(): object_hook function doesn't recurse into maps or lists properly

juliannguyen4 opened this issue · 14 comments

When using msgpack.unpackb() to decode bytes to Python native types, the object_hook function isn't recursing into objects that contain other nested objects. For example, object_hook will be called for maps but not the keys and values inside the map. And it doesn't get called for lists or list items at all:

>>> import msgpack
>>> b = msgpack.packb({"a": 1})
>>> print(b)
b'\x81\xa1a\x01' # fixmap object containing 2 objects
>>> def object_hook(obj):
...     print("Object hook:", obj)
...     return obj
... 
>>> msgpack.unpackb(b, object_hook=object_hook)
Object hook: {'a': 1}
# Expected missing output:
# Object hook: 'a'
# Object hook: 1
{'a': 1}
>>> l = msgpack.packb(["a", "b"])
>>> print(l)
b'\x92\xa1a\xa1b' # fixarray object containing 2 objects
>>> msgpack.unpackb(l, object_hook=object_hook)
# Expected missing output:
# Object hook: ['a', 'b']
# Object hook: 'a'
# Object hook: 'b'
['a', 'b']

It's intended behavior. That design is compatible with standard JSON library.

Like JSON, you can use higher layer converter like pydantic to serialize&deserialize user defined types.

Hi @methane, I'm still confused why object_hook can't reach those objects in the example map and list. object_hook isn't called on the string 'a' and number 1 in the map, and it's not called on the list and the strings 'a' and 'b' inside that list. Aren't strings, numbers, and lists all JSON types?

Aren't strings, numbers, and lists all JSON types?

They are "Primitives", not "Object" in JavaScript.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#objects

jfolz commented

https://msgpack-python.readthedocs.io/en/latest/api.html#msgpack.Unpacker

object_hook (callable) – When specified, it should be callable. Unpacker calls it with a dict argument after unpacking msgpack map. (See also simplejson)

khaf commented

Hi @methane , we need this feature to be able to differentiate between blobs, strings and GeoJson values in Aerospike database. We have an extension that prefixes STR values with a byte that determines their values. Without this feature, we'd have to iterate on all maps after unpacking them and redo the typing and remove that header byte. That iteration is a huge handicap for a database driver.

Do you see a workaround to this problem to avoid that final iteration?

Do we have to manually parse the string values with this class? https://msgpack-python.readthedocs.io/en/latest/api.html#msgpack.Unpacker

msgpack distinguish bytes and str. And there is Ext type.

khaf commented

That wasn't the case more than a decade ago when we began using msgpack, and we still have to support petabytes of data in already existing databases. Any way we could get to what we need using this library?

Use object_hook (and list_hook if necessary) and iterate values.
I won't add str_hook and bytes_hook only for few users.

jfolz commented

That wasn't the case more than a decade ago when we began using msgpack

I obviously don't know what you were doing 10 years ago, but the behavior of object_hook has been exactly the same since it was added in 2010.

In the Unpacker class, would it be possible to call the read_array_header() function recursively so we can parse objects in nested lists inside lists? For example, take a list such as:

[["\x03\x62", "\x03\x63"]]

And I want to process each string in a list recursively by removing the \x03 character, so the list would become:

[["\x62", "\x63"]]

The problem is after calling read_array_header() the intended use is to call unpack() n times, but unpack() doesn't allow us to parse nested lists properly.

Hi @methane, if I open a pull request to add support for str_hook (I don't believe we need bytes_hook), could you accept it?

No.
I won't add str_hook and bytes_hook only for few users.