Add specific data type
Closed this issue · 2 comments
EventDescriptors specific that data type of the data in Events and StreamResources in the key dtype
using jsonschema datatypes:
event-model/event_model/schemas/event_descriptor.json
Lines 47 to 53 in 1451187
This is the tragedy of, "You must define the core of your software at the beginning, when you understand the problem the least." At the time (2015) we were focused on MongoDB (bson) and Python applications, where data types can be coarsely defined. We now view this as a mistake. The array
option is particular does not make sense: we have shape
for that. We should have given specific types.
How should we add them now?
Decision: New key or expand dtype
enum?
If we expand the dtype
enum to optionally specify a specific data type instead of the jsonschema types, this could break downstream consumers (some in code that we do not know about) that have been able to expect the jsonschema types for the last ~8 years. It seems safest to add a new key sitting beside dtype
. In "Bluesky 2.0" this could be cleaned up / consolidated and documented as a backward-incompatible change.
Decision: How to spell the data type?
Three ideas have been proposed:
- Use the NumPy array protocol type string (typestr) format, e.g.
"<f8"
,">i4"
,"|b1"
. There is precedent for using this as a way of encoding numpy data types in JSON: the Zarr v2 spec does so. - Use the newer Zarr v3 specification, which opts for a more constrained set of supported types with more human-readable names, e.g.
float64
,int8
,bool
. Types are little-endian. Big-endianness is a handled as a property of the encoding (a codec). - Use Arrow, which supports a super-set of these types. However, Arrow has no officially-supported JSON encoding. Its schema is binary; it would be have to be base64-encoded or similar---not human-readable. For that reason, I think it is easy to reject this option.
As of May 9, Zarr v3 is still just a specification, with a Python implementation still in progress so it feels a bit early to hitch our wagon to that standard. The Numpy strings, while not exactly a "specification", have been around a long time and are unlikely to change. My (loosely-held) view is that we should use Numpy strings but leave open the possiblity of adopting something different, hopefully something formally specified, in the future.
Decision: What to call the new key?
Ideas proposed:
dtype_str
dtype_numpy
dtype_zarr2
datatype
I think having both dtype
(jsonschema legacy) and datatype
(new thing) together would be confusing. (I would be in favor of consolidating on something like datatype
in Bluesky 2.0.)
One advantage of something specific like dtype_numpy
is it would let us add Zarr v3 or something else in the future unambiguously.
Status Quo
On the floor at NSLS-II, we have been using the key dtype_str
and the Numpy typestr spellings. This solves the practical problem that Tiled needs to know the real data types in order to inform clients so they can pre-allocate numpy or dask arrays to download chunks of data into.
But dtype_str
was never added to event-model or formally decided. The goal of this issue is to make a decision and add something to the event-model schema.
Suggest dtype_numpy
, and have str
as the json schema type. This means that if you put garbage
in then the json schema says everything is fine, then something downstream might fall over and you will get no early warning, but that is probably ok