lebedov/msgpack-numpy

msgpack_numpy shouldn't support np.int type!

weiyinfu opened this issue · 9 comments

msgpack's strength is less space and run faster.But if I use msgpack_numpy,the space used can be larger than json. The reason is that msgpack-numpy support np.ndarray,np.bool,np.int...
msgpack_numpy will serialize np.int to a four-tuple dict.

import numpy as np
import msgpack_numpy as msgpack
x = np.array([1, 2, 5], dtype=np.int32)
xx = [i for i in x]

y = msgpack.packb(xx)
print(msgpack.packb(x.tolist()))

So,I think msgpack_numpy shouldn't do the job it cannot handle well.It's a missleading to other people.You just need to encode np.ndarray rather than np.int.

In order to faithfully serialize/deserialize numpy's special numeric types, there is a need to encode the corresponding numeric data type info; without the data type info, it isn't possible to unambiguously reconstruct the serialized data with the original type. This will necessarily cause serialization of scalar instances of such types (such as those contained in a list) to include the type info at the expense of the length of the resulting serialization. Some data structures (such as [np.int32(1), np.int32(2), np.int32(3)]), may be more efficiently encoded with JSON rather than msgpack_numpy, but at the expense of the original data's numeric type. If you need to serialize arrays of integers with msgpack and don't care about the data type, I recommend converting them to lists of Python ints before serialization.

msgpack_numpy surely serialized many types such as np.int,np.bool.
But msgpack_numpy is not good at serialize np.int,np.bool type.
So I think the better way is remove the ability to serialize np.int.
The current code is easy to make mistake.
To avoid users make mistake,msgpack_numpy should just do things it can handle well.
And leave the hard part to user.
Rather than let user think they solve a problem properly but they write a bad code.

np.int and np.bool are Python built-in types, not numpy types; msgpack_numpy just passes them to msgpack for serialization; preventing this would break serialization for many data structures.

The design goal of msgpack_numpy is to enable msgpack serialization/deserialization of numpy arrays and data types that preserve type. I'm open to considering more efficient ways of doing the serialization in a way that preserves type, but preventing serialization for certain scenarios (such as data structures containing scalar instances of numpy types, which seem to be the ones that are troubling you) would adversely affect other users who depend upon preservation of data type and hence is a change I'm not willing to make.

I'm still not sure for exactly what use case you want to improve serialization efficiency, but if it doesn't require preservation of numpy type (e.g., because you assume - say - that all ints are always 64-bit), you can write a custom encoder/decoder function pair that includes your assumptions and pass it to the Python msgpack classes (as msgpack_numpy does).

numpy is not python built-in package.
So how can np.int be Python's built-in type?
You must use import numpy as np before you use np.int

I'm still not sure for exactly what use case you want to improve serialization efficiency,

My usecase I have proposed at first.Maybe you didn't pay attention to it.

import numpy as np
import msgpack_numpy as msgpack
x = np.array([1, 2, 5], dtype=np.int32)
xx = [i for i in x] #xx is List[np.int]

y = msgpack.packb(xx) # y is bytes,and it's much larger than the below
print(msgpack.packb(x.tolist()))

My point is that : I believe msgpack_numpy so I didn't check the data that I'm serializing. But you disappointed me because the serialize result is too large(compare with json). So you didn't do it well and the msgpack_numpy's implementation should be blamed.

numpy is not python built-in package.
So how can np.int be Python's built-in type?
You must use import numpy as np before you use np.int

numpy imports several built-in types into its namespace; np.int is not the same as np.int32 or np.int64.

I'm still not sure for exactly what use case you want to improve serialization efficiency,

My usecase I have proposed at first.Maybe you didn't pay attention to it.

import numpy as np
import msgpack_numpy as msgpack
x = np.array([1, 2, 5], dtype=np.int32)
xx = [i for i in x] #xx is List[np.int]

y = msgpack.packb(xx) # y is bytes,and it's much larger than the below
print(msgpack.packb(x.tolist()))

In your example, xx is a list of np.int32 instances, not Python built-in int instances (you can confirm this by running type(xx[0])); x.tolist() converts the contents of x into Python built-in int.

My point is that : I believe msgpack_numpy so I didn't check the data that I'm serializing. But you disappointed me because the serialize result is too large(compare with json). So you didn't do it well and the msgpack_numpy's implementation should be blamed.

msgpack-numpy unfortunately cannot accommodate all use cases simultaneously given that some have conflicting design implications; as previously mentioned, there are several ways to use msgpack more efficiently in your particular use case. That said, I have added a note in the README file indicating the project's design goal so as to reduce future confusion.