A cf-xarray compliance checker?
kthyng opened this issue · 9 comments
Would something like this be in scope for cf-xarray? It would need to be fairly loosely defined, but maybe a minimum would be that a Dataset would have axes and coordinates all defined? Variables would need standard_names? Though some variables don't usually have standard names like maybe "angle" on a ROMS grid.
A number of these exist:
so i don't think we should reinvent it. It would be nice if we could run the checker on a Dataset using ds.cf.check(checker="ioos")
for example
cc @ocefpaf
For another project I've been looking at CF checkers last week, and it looks like all options are mostly command-line tools meant to check NetCDF files.
It would be great if cf-xarray
allows to check any format supported by xarray and datasets that have not been written on disk. I also think it would be great to use other checkers in the backend, but looks like before doing it changes are needed in compliance-checker
and cf-checker
(i.e., the checkers only accept paths right now, they would have to accept xarray datasets as well).
It'd be nice to build an API connection, but worst case we can write a tiny dataset with all attributes to /tmp/check.nc
and run that, and print the output to screen.
I have mixed feelings. While I don't want to overload cf-xarray
with functionalities that exists elsewhere this could be a nice idea b/c:
- what @malmans2 said above
- compliance-checker is super verbose and sometimes you don't want a full CF check, just a bare bones "what is missing so I can plot this automatically, or load this data into analysis X." In a way, iris used to be like that but has become more and more restrictive with time.
I guess that, instead of becoming a compliance-checker cf-xarray could have a "verbose mode" where all the compliance issues would be printed when loading a dataset.
"what is missing so I can plot this automatically, or load this data into analysis X."
This is hard to define!
This is hard to define!
Indeed! That is why cc is super verbose, kind of all or nothing. However, @kthyng suggestion above looks like a nice start:
- axes and coordinates
- valid standard_names
- enough variables defined to compute say
z
for example
More than that we would get into the weeds of CF but those 3 lines ensure almost all of plotting with labels.
I wrote some tests for a package: https://github.com/NOAA-ORR-ERD/model_catalogs/blob/main/model_catalogs/tests/test_catalogs.py#L326-L369
When the models are read in with the package, they should be able to be used by cf-xarray
in a basic way. I am finding I need this functionality again so that is when I thought it could be useful in cf-xarray itself. It could warn a user if no axes or coordinates are known for a Dataset/Array, and which data_vars do not have standard_names. I also like the connection @ocefpaf said for being able to calculate z.
NASA-specific compliance checker: https://github.com/eugenegesdisc/diwg-data-compliance-test
This is hard to define!
Indeed! That is why cc is super verbose, kind of all or nothing. However, @kthyng suggestion above looks like a nice start:
- axes and coordinates
- valid
standard_name
s
I'd suggest allowing long_name
s as an option, for those variables that aren't in the standard name table yet. You can add a warning pointing to the forum for adding standard names if you want to discourage long_name
without standard_name
.
- enough variables defined to compute say
z
for example
Everything mentioned in formula_terms
or similar, at a guess? Or do you want enough information to convert from the model vertical coordinate to a geometric vertical coordinate?
More than that we would get into the weeds of CF but those 3 lines ensure almost all of plotting with labels.
I'd suggest a fourth check for units
: it's possible to guess from values, but I like having that explicitly