casangi/xradio

Extend schema to check explicit literal values in quantities and measures

Closed this issue · 6 comments

Add literals to check for the exact value of units, types, frames, and other fields of quantities and measures, wherever there is a mandatory value or a list of allowed values. The current schema checks the presence and types of these fields but doesn't enforce for example units of 'Hz for frequencies or spectral_coords.

  • Quantities: specify units, as literals. Rather than generic "str" type have data variable/attribute specific units such as ["Hz"], ["m"], ["K"], etc. Also specify type as literal "quantity"

  • Measures: add also lists of allowed values, as literals, in measures fields such as spectral_coord/frame, time/scale, etc.

  • Also, the type attribute of schema classes that are not measures ("weather", "phase_calibration', "visiblity", "pointing", etc.)

  • Split the main xds class in the schema and rename it from CorrectedDataXds => VisibilityXds, SpectrumXds

My proposal would be to allow a _validate function in schema definitions that returns SchemaIssues. That can then be used to check arbitrary properties.

Though I am not entirely sure we would want to lock down units down specifically... There might be use-cases for units that we have not anticipated? Not sure.

There are units that I think we'll have to leave open to any str, like the units of VISIBILITY (and WEIGHT, by propagation), unless there is some better idea of a bounded list of acceptable options. But there are units that in our spreadsheet (and the MSv2) are fixed to a very concrete value (to SI values in general I think). For example times have units = ["s"], frequencies have units = ["Hz"], weather_xds/TEMPERATURE has units = ["K"], etc. If we want this to be checked and, importantly, documented in the sphinx docs, these have to be added to the schema. The way of doing this that I've been checking is along the lines of replacing the str types with Literal, lists of Literal (coordinates), Unions of list of Literal (when we have alternative lists of units), etc. Then for things like the fields of measures that have a list of accepted values (spectral_coord/frame) the idea is to replace the str type with a Literal["gcrs", "icrs", ...]. This seems to be working so far (just needed some code added to schema_table.py to print the "ors" and lists of quoted values for such combinations of types). That is the idea of the work being done in this branch. @scpmw , @Jan-Willem please let me know if the plans are different.

@FedeMPouzols that sounds good to me. The observatories can choose the VISIBILITY and WEIGHT units. We might want to check that WEIGHT units are the inverse squared of the VISIBILITY units.

Right, had not realised that typeguard supports this for typing.Literal out of the box. Sounds good!

Just please don't skip setting the default as well so that the constructor works correctly.

This branch now has all the units and measures fields such as frame/coordianate_system that have either a fixed value or a limited set of allowed values. With this the sphinx doc should show all the information on units and measures that we have in the spreadsheet.
It also renames/splits the "main" schema class to VisibilityXds and SpectrumXds.

For frames, etc. fields of measures that have a pre-defined list of allowed values: added the list that we have in the spreadsheet. In some cases there are additions/changes (the motivation sometimes is to have tests and checks passing without changes in the code that I wouldn't be sure about. In other cases like UNIX=>unix I just did the changes needed for the conversion code to match the spreadsheet). There are some related comments in the spreadsheet.

About measures and allowed values:

  • location: to the frames listed in the spreadsheet I added "NA", which we use in cases like field_and_source_xds/SUB_OBSERVER_DIRECTION,SUB_SOLAR_POSITION
  • time: the spreadsheet does not have the full list from astropy but has a few ones other than unix/utc. I added the ones from the spreadsheet. But I'm wondering if for example for time measures we should just have scale="utc" and format="unix"? (note that in casatesdata we seem to have some instances of "tai" so we might have to live with the longer list or at least that one.)
  • spectral_coord/frame: the ones that the code is generating are the casacore frames ("REST", "LSRK"..."TOPO"... "CMB", "Undefined") . The spreadsheet lists the astropy ones + "topo". For now I've added the two lists, but I presume this needs to be pruned. I'm not sure avoid simply translating the names, as the details and calculations tend to differ. (for example I do not see any clearly direct translation of ("GALACTO": Galacto centric (with rotation of 220 km/s in direction l,b = [90,0] deg.) to astropy frames. Can we for example translate casacore's "GEO" to astropy's "gcrs"?
  • sky_coord/frame: I added all the astropy ones (some of them really really esoteric) - https://docs.astropy.org/en/stable/coordinates/index.html#built-in-frame-classes. From the conversion code, what we do about this is to translate the MSv2/casacore "J2000" to "FK5". But we would need to decide what to do with things like "AZELGEO" - would it be safe to translate to AltAz (casacore has all the AZELGEO, AZELSWGEO, and AZELNEGEO? For now the list of allowed values has "AZELGEO", as it is used virtually always in the pointing table. We will also have issues with VLA datasets that have " B1950_VLA" frame in the field subtable.
  • doppler/type: I think the only place where this is used (main_xds.frequency.attrs['doppler']) is optional. And we don't do anything about this in the converter. I added the frames from the spreadsheet. The only small question here is that these are the ones from casacore but in lowercase (was that intentional? I'd guess to better match the astropy conventions), but I think we are not translating the values from MSv2s.

I forgot to mention one point I'm not totally happy about: the repetition of QuantityArray classes for different units. It works well for documentation purposes but it looks very repetitive in the schema file. There might be better ways of defining the unit-specific quantities. I experimented a bit with generic Quantity classes (using Generic, as it seems we have to wait until the times of Python >=3.12 to use the [T] syntax) but that was producing errors in dataclass.