Send boolean values as boolean rather than integers

Question

Send boolean values as boolean rather than integers

Closed this issue 2 years ago · 7 comments

Hi again 👋

The new enum metadata works great for boolean datasets/attributes 👍

However, values are returned as Int8Array for arrays or integers for scalars. This means conversion must be done by the consumer which brings issues for nD datasets/attributes.

Do you think it would be possible to return arrays of booleans (or simply booleans for scalars) instead ?

bmaranville commented 3 years ago

see #26

Answer 1 · 2022-05-12T11:36:04.000Z

What h5py does in handling boolean datasets (and complex datasets for that matter) is a little "magic", given that boolean datatypes are not implemented directly in the HDF5 specification. I would prefer more explicit mechanisms, but if there is overwhelming benefit it could of course be done.

Answer 2 · 2022-05-12T12:09:18.000Z

Continuing discussion from silx-kit/h5web#1112 (comment)

There are performance (and other) issues with using nested arrays instead of typed arrays. I still don't understand the scope of the request - is it preferred in your use case that all dataset and attribute values be converted to nested arrays, or just attribute values, or just "boolean" attribute values?

Ideally, I would like to have boolean values (regardless if it is from a dataset or an attribute) to be return as nested arrays of booleans.

Examples:

Original dataset/attribute in h5py (Py)	Ideal returned value (JS)	Actual returned value (JS)
`True`	`true`	`1`
`[True, False]`	`[true, false]`	`Int8Array(2) [ 1, 0 ]`
`[ [True, False], [True, False] ]`	`[ [true, false], [true, false] ]`	`Int8Array(4) [1, 0, 1, 0]`

I agree that it is a bit of "magic" as you said which is why I wanted to discuss it first with you. I also think that performance issues are mitigated as nested arrays would only be used for booleans and I don't expect huge boolean datasets/attributes.

Answer 3 · 2022-05-12T17:23:04.000Z

What is shown for h5py in the table above is really the result of two levels of special handling... first h5py follows a convention that any enum with members {"TRUE", "FALSE"} should trigger the creation of a numpy array with dtype 'bool' (it is still passing a 1D buffer of bytes into that numpy array on construction), and then numpy has special __repr__ and tolist serialization methods for bool arrays. The result of tolist is what is shown above.

I would consider adding a tolist method to datasets and attributes, since we don't have a nice intermediate container library in javascript that corresponds to the role of numpy.ndarray with h5py.

I would also agree that we could add special handling for the case of enum {"TRUE", "FALSE"}, where we could say that this is "in agreement with the h5py convention" for creating boolean arrays in HDF5, and return (flat, 1D) arrays of JS boolean values from .value (and possibly nested values from .tolist(), as above)

I don't think it would make sense to return nested arrays from Dataset.value or Attribute.value for just this one particular instance of the enum datatype, when none of the other datatypes (float, int, enum...) would be handled this way.

EDIT: I just realized that the enum has to be {"FALSE", "TRUE"} instead of the other way around, but that doesn't affect the above discussion 😐

Answer 4 · 2022-05-13T06:32:16.000Z

Thanks #26 is already a great improvement !

Given your concerns, having a tolist method would indeed be nice. This way, the default behaviour of typed arrays would not change and I could opt-in returning nested arrays by calling this method

Answer 5 · 2022-05-18T18:13:01.000Z

an implementation of to_array is in #27 if you want to comment on it. I think it roughly does what tolist does in numpy.

Answer 6 · 2022-06-08T12:26:17.000Z

v0.4.4 fits the bill for me !