xarray-contrib/cf-xarray

A cf-xarray compliance checker?

kthyng opened this issue · 9 comments

Would something like this be in scope for cf-xarray? It would need to be fairly loosely defined, but maybe a minimum would be that a Dataset would have axes and coordinates all defined? Variables would need standard_names? Though some variables don't usually have standard names like maybe "angle" on a ROMS grid.

A number of these exist:

so i don't think we should reinvent it. It would be nice if we could run the checker on a Dataset using ds.cf.check(checker="ioos") for example

cc @ocefpaf

For another project I've been looking at CF checkers last week, and it looks like all options are mostly command-line tools meant to check NetCDF files.

It would be great if cf-xarray allows to check any format supported by xarray and datasets that have not been written on disk. I also think it would be great to use other checkers in the backend, but looks like before doing it changes are needed in compliance-checker and cf-checker (i.e., the checkers only accept paths right now, they would have to accept xarray datasets as well).

It'd be nice to build an API connection, but worst case we can write a tiny dataset with all attributes to /tmp/check.nc and run that, and print the output to screen.

I have mixed feelings. While I don't want to overload cf-xarray with functionalities that exists elsewhere this could be a nice idea b/c:

  1. what @malmans2 said above
  2. compliance-checker is super verbose and sometimes you don't want a full CF check, just a bare bones "what is missing so I can plot this automatically, or load this data into analysis X." In a way, iris used to be like that but has become more and more restrictive with time.

I guess that, instead of becoming a compliance-checker cf-xarray could have a "verbose mode" where all the compliance issues would be printed when loading a dataset.

"what is missing so I can plot this automatically, or load this data into analysis X."

This is hard to define!

This is hard to define!

Indeed! That is why cc is super verbose, kind of all or nothing. However, @kthyng suggestion above looks like a nice start:

  1. axes and coordinates
  2. valid standard_names
  3. enough variables defined to compute say z for example

More than that we would get into the weeds of CF but those 3 lines ensure almost all of plotting with labels.

I wrote some tests for a package: https://github.com/NOAA-ORR-ERD/model_catalogs/blob/main/model_catalogs/tests/test_catalogs.py#L326-L369

When the models are read in with the package, they should be able to be used by cf-xarray in a basic way. I am finding I need this functionality again so that is when I thought it could be useful in cf-xarray itself. It could warn a user if no axes or coordinates are known for a Dataset/Array, and which data_vars do not have standard_names. I also like the connection @ocefpaf said for being able to calculate z.

This is hard to define!

Indeed! That is why cc is super verbose, kind of all or nothing. However, @kthyng suggestion above looks like a nice start:

  1. axes and coordinates
  2. valid standard_names

I'd suggest allowing long_names as an option, for those variables that aren't in the standard name table yet. You can add a warning pointing to the forum for adding standard names if you want to discourage long_name without standard_name.

  1. enough variables defined to compute say z for example

Everything mentioned in formula_terms or similar, at a guess? Or do you want enough information to convert from the model vertical coordinate to a geometric vertical coordinate?

More than that we would get into the weeds of CF but those 3 lines ensure almost all of plotting with labels.

I'd suggest a fourth check for units: it's possible to guess from values, but I like having that explicitly