Should the standard describe document-level schemas?

Question

Should the standard describe document-level schemas?

eslavich opened this issue 4 years ago · 8 comments

The current ASDF Standard has a lot to say about schemas for individual tagged objects, but so far we don't offer any guidance on schemas that describe the ASDF file as a whole. For the sake of discussion I'm going to refer to these top-level schemas as "document" schemas.

When reading and writing this file:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
# metadata omitted for clarity
some_array: !core/ndarray-1.0.0
  source: 0
  datatype: int64
  byteorder: little
  shape: [3]
...

The libraries can confirm that some_array is a correctly structured ndarray-1.0.0, but how do we validate that some_array is present and isn't set to some other tagged object?

The ASDF Python library has a feature that enables a second validation pass across the whole ASDF file using a document schema (this is the custom_schema argument to asdf.open), but that feature seems to have been a bit of an afterthought and the fact that a custom schema was used isn't recorded anywhere in the file. The custom schema also has to be permissive enough to allow the ASDF metadata objects (or include refs to them) which limits its utility.

I wonder if we ought to nest the user data one level deeper in the YAML:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
# metadata omitted for clarity
tree: !<http://stsci.edu/schemas/jwst_datamodel/ramp.schema>
 some_array: !core/ndarray-1.0.0
   source: 0
   datatype: int64
   byteorder: little
   shape: [3]
...

Then validate that tree against any document schema. Maybe by convention the Python library should always deserialize the tree node into a simple dict.

In general I think document schemas are a good idea, since otherwise you have no idea what you're getting when you open up a given ASDF file. Users of languages like Java will want to define a class ahead of time that matches the structure of the document, and it would be ideal to be able to follow a schema to do that.

@jdavies-st @perrygreenfield @embray particularly interested to hear your thoughts about this, if you can spare the time...

Answer 1 · 2020-06-23T18:51:37.000Z

Yeah, we use document-level schemas in jwst.datamodels for exactly this purpose. Of course ours are actually schema fragments, as they don't describe any particular ASDF object or tag. But yeah, they are very useful, and the current way to use them in they python asdf library is very clunky.

Answer 2 · 2020-06-23T19:30:33.000Z

How about a metadata item in the file that provides a tag for the tree through a different syntactic mechanism (e.g., special comment string or some such)?

Answer 3 · 2020-06-23T20:38:59.000Z

How about a metadata item in the file that provides a tag for the tree through a different syntactic mechanism (e.g., special comment string or some such)?

I think it's a good idea to use a different mechanism, since it would be confusing if that one tagged node behaved differently from the rest. I'd prefer to use another metadata field in the YAML itself rather than a comment string.

What do you think about the idea to nest the user data into a new node? I think that would be helpful for two reasons. One is that a document schema could specify additionalProperties: false and not have to include the metadata fields. The other benefit is that users wouldn't have to reckon with the metadata in their tree -- they wouldn't see them in the tree dict in Python, wouldn't have to avoid property names reserved by asdf.

Maybe we even push all the metadata into a single node too:

#ASDF 2.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
metadata:
  # ...
  schema: http://stsci.edu/schemas/jwst_datamodel/ramp.schema
tree:
  # ...

Answer 4 · 2020-06-23T20:45:07.000Z

That seems reasonable. Let's see what others think.

Answer 5 · 2020-06-24T10:01:44.000Z

Yeah, I wouldn't use a tag for this as in your first example that was like

tree: !<http://stsci.edu/schemas/jwst_datamodel/ramp.schema>

but something more like your second example:

metadata:
  # ...
  schema: http://stsci.edu/schemas/jwst_datamodel/ramp.schema

There has been some discussion (e.g. here) about the role of the $schema property in schemas. In a schema, IIRC, it basically designates what meta-schema the schema conforms to.

But there's also no reason the $schema property couldn't be used in a data document. In this case there is no standard for how this is meant to be interpreted (though I think elsewhere there has been discussion about having a standard for this but I can't recall where I saw that). Point being we are in principle free to write into the standard that $schema in a data document indicates a JSON Schema the document should be validated against (I would use the spelling $schema instead of just schema since there's precedence for using a dollar sign to indicate properties that have a special meaning w.r.t. how the document is structured).

It is also pointed out in the issue I linked to that there's prior art for this. Microsoft Intellisense's JSON editor uses $schema in data documents in the same way.

Answer 6 · 2020-06-24T13:29:29.000Z

Thanks @embray that's tremendously helpful.

Depending on how we end up structuring our YAML we may not need the dollar sign -- if the user's "document" is stored in a nested node then there would be no chance of a name collision with schema. It may in fact be misleading to use $schema since the schema wouldn't actually apply to the whole document, only the user data node.

Answer 7 · 2020-06-24T13:41:13.000Z

@Cadair I noticed a comment from you on an unrelated PR:

I am using a top level schema to ensure the user is loading a DKIST asdf as expected.

Is the feature we're describing here something that you would be able to use?

Answer 8 · 2020-06-30T10:52:05.000Z

Listing and supporting document schemas in the file (regardless of implementation) would be an excellent addition to the standard in my opinion.