Should the standard describe document-level schemas?
eslavich opened this issue · 8 comments
The current ASDF Standard has a lot to say about schemas for individual tagged objects, but so far we don't offer any guidance on schemas that describe the ASDF file as a whole. For the sake of discussion I'm going to refer to these top-level schemas as "document" schemas.
When reading and writing this file:
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
# metadata omitted for clarity
some_array: !core/ndarray-1.0.0
source: 0
datatype: int64
byteorder: little
shape: [3]
...
The libraries can confirm that some_array
is a correctly structured ndarray-1.0.0
, but how do we validate that some_array
is present and isn't set to some other tagged object?
The ASDF Python library has a feature that enables a second validation pass across the whole ASDF file using a document schema (this is the custom_schema
argument to asdf.open), but that feature seems to have been a bit of an afterthought and the fact that a custom schema was used isn't recorded anywhere in the file. The custom schema also has to be permissive enough to allow the ASDF metadata objects (or include refs to them) which limits its utility.
I wonder if we ought to nest the user data one level deeper in the YAML:
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
# metadata omitted for clarity
tree: !<http://stsci.edu/schemas/jwst_datamodel/ramp.schema>
some_array: !core/ndarray-1.0.0
source: 0
datatype: int64
byteorder: little
shape: [3]
...
Then validate that tree against any document schema. Maybe by convention the Python library should always deserialize the tree
node into a simple dict
.
In general I think document schemas are a good idea, since otherwise you have no idea what you're getting when you open up a given ASDF file. Users of languages like Java will want to define a class ahead of time that matches the structure of the document, and it would be ideal to be able to follow a schema to do that.
@jdavies-st @perrygreenfield @embray particularly interested to hear your thoughts about this, if you can spare the time...
Yeah, we use document-level schemas in jwst.datamodels
for exactly this purpose. Of course ours are actually schema fragments, as they don't describe any particular ASDF object or tag. But yeah, they are very useful, and the current way to use them in they python asdf
library is very clunky.
How about a metadata item in the file that provides a tag for the tree through a different syntactic mechanism (e.g., special comment string or some such)?
How about a metadata item in the file that provides a tag for the tree through a different syntactic mechanism (e.g., special comment string or some such)?
I think it's a good idea to use a different mechanism, since it would be confusing if that one tagged node behaved differently from the rest. I'd prefer to use another metadata field in the YAML itself rather than a comment string.
What do you think about the idea to nest the user data into a new node? I think that would be helpful for two reasons. One is that a document schema could specify additionalProperties: false
and not have to include the metadata fields. The other benefit is that users wouldn't have to reckon with the metadata in their tree -- they wouldn't see them in the tree dict in Python, wouldn't have to avoid property names reserved by asdf.
Maybe we even push all the metadata into a single node too:
#ASDF 2.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
metadata:
# ...
schema: http://stsci.edu/schemas/jwst_datamodel/ramp.schema
tree:
# ...
That seems reasonable. Let's see what others think.
Yeah, I wouldn't use a tag for this as in your first example that was like
tree: !<http://stsci.edu/schemas/jwst_datamodel/ramp.schema>
but something more like your second example:
metadata:
# ...
schema: http://stsci.edu/schemas/jwst_datamodel/ramp.schema
There has been some discussion (e.g. here) about the role of the $schema
property in schemas. In a schema, IIRC, it basically designates what meta-schema the schema conforms to.
But there's also no reason the $schema
property couldn't be used in a data document. In this case there is no standard for how this is meant to be interpreted (though I think elsewhere there has been discussion about having a standard for this but I can't recall where I saw that). Point being we are in principle free to write into the standard that $schema
in a data document indicates a JSON Schema the document should be validated against (I would use the spelling $schema
instead of just schema
since there's precedence for using a dollar sign to indicate properties that have a special meaning w.r.t. how the document is structured).
It is also pointed out in the issue I linked to that there's prior art for this. Microsoft Intellisense's JSON editor uses $schema
in data documents in the same way.
Thanks @embray that's tremendously helpful.
Depending on how we end up structuring our YAML we may not need the dollar sign -- if the user's "document" is stored in a nested node then there would be no chance of a name collision with schema
. It may in fact be misleading to use $schema
since the schema wouldn't actually apply to the whole document, only the user data node.
@Cadair I noticed a comment from you on an unrelated PR:
I am using a top level schema to ensure the user is loading a DKIST asdf as expected.
Is the feature we're describing here something that you would be able to use?
Listing and supporting document schemas in the file (regardless of implementation) would be an excellent addition to the standard in my opinion.