HDFGroup/hsds

Support fixed-length strings with UTF-8 character set

mattjala opened this issue · 7 comments

HSDS currently does not support these (see hdf5dtype.py:617)

Are they supported in the library?

Yes, string encoding and how many bytes are reserved for its storage are decoupled.

Are they supported in the library?

Yep, see here for an example of fixed-length unicode strings being used in datasets/attributes - the native VOL passes both of these tests.

The question may be more related to how h5py treats HDF5 strings where this combo is not really supported. Any fixed-length string is treated as bytes object, not Unicode string.

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

I think there's a confusion in terminology here. The request is not support for a unicode character encoding where each particular character has a fixed width in bytes (e.g. UTF-32), but support for string datatypes that have a fixed total length in bytes (fixed length strings) AND have the character set/encoding UTF-8 (where a particular character does not have a fixed number of bytes associated with it).

I've updated the title of this issue to be more clear. The library does support fixed-length strings in UTF-8 (See the tests I linked above).

Implemented in #278