NOAA-GFDL/MDTF-diagnostics

How to make one settings.jsonc that works for NCAR, GFDL, and CMIP fields?

emaroon opened this issue · 9 comments

What problem will this feature solve?
I've been working with CESM output and am able to read it in and manipulate it in the MDTF framework, which is great. However, at some point in the near future, I will want my POD to also work on GFDL and CMIP fields. Unfortunately, the variable names in the three fieldlists for a single variable are all different. One example of this is northward wind: va in the CMIP fieldlist, vcomp in the GFDL fieldlist, and V in the NCAR field list. All three do have the same standard name (northward_wind). Does the framework use the standard name as the key? I'm not using northward_wind, but I am starting to add ocean fields to the CESM fieldlist to get my PODs working. I'd like to do it in a way that is compatible with past PODs while also minimizing preprocessing before the preprocessor.

Another example is the variable areacello in the NCAR_fieldlist. It looks like this is the same variable as TAREA, which is the default CESM name for it and that the tropical_pacific_sealevel POD may have manually renamed TAREA to areacello to get things working. Should I do the same?

Do I need to rename all CESM variables/dimensions prior to loading them into the MDTF framework? This would make the framework more cumbersome to setup. With three different fieldlists, it seems like it should be possible to have one settings.jsonc that works for all three file formats? I have a feeling that I'm missing something key about how to put together a settings.jsonc...

Describe the solution you'd like
In my settings.jsonc, I would ideally want to use one variable name or other key in the settings.jsonc that works across all three file formats without also needing to preprocess any variable/dimension names in the model output files.

Describe alternatives you've considered
I have a feeling that this feature already exists and that I just don't know how to implement it correctly. Please accept my apologies in advance for my likely ignorance!

Additional context
Here's my current fork/branch where I'm trying to read, in case it is helpful. I've been making additions to fieldlist_NCAR to make it work:
https://github.com/emaroon/MDTF-diagnostics/tree/allfiles

@emaroon You can use whatever of the 3 standards your POD expects. You are correct that the standard_name must correspond to the entry in the fieldlist that the POD data convention expects
(moving forward, POD settings files will specify additional information like axes, coordinates, and the expected data standard to improve the translation capabilities). If the fieldlist corresponding to the dataset standard is missing variables, then we can update as necessary (or you can do it yourself as in your current branch and submit a PR).

If you don't want to preprocess OR translate the data, the only solution at this time is to set data_manager to "no_pp" in your runtime configuration file, and run the POD on data that matches the convention that your POD expects (I believe this was proposed in a prior issue when you or a grad student was simply trying to run the POD on raw data), or to change the POD settings file to match the convention of the incoming data. If your just want the translation and cropdaterange function (grabs the subset of data in the desired date range) , and no other preprocessing, then add the line "disable_preprocessor": "true" to the runtime configuration file.

Hi @wrongkindofdoctor, Thanks for the quick response! So if I understand correctly, that means that if the standard_name matches for the same field in all 3 field lists, then the POD will work interchangeably on all three data types regardless of which convention I use in my settings.jsonc? Cool!

Unless I run out of options, I'd like to preprocess using the MDTF rather than no_pp mode. I think this should be possible, or if not, we should make it possible so that NCAR/CESM ocean folks can make better use of the MDTF PODs.

Ok- I've started updating the CESM fieldlist for the variables that my POD needs, but will probably need your and/or @jkrasting's help with adding those variables to the other 2 fieldlists. You can see the variables that I've added in the branch referenced above. Only one variable still needs to be added, which is TAREA/areacello. That variable probably needs some attention: areacello is not the CESM standard name for cell area, TAREA and UAREA are.

That then raises a related issue, however, which is related to my previous issue about TLAT/TLON. What if the variable in CESM vs CMIP has a very different format? For example, latitude in CESM is TLAT, a 2d coordinate that has 2 index dimensions of nlat and nlon, while latitude in CMIP is a 1-dimensional coordinate. How does one reference latitude for both of these types? FWIW, for reading in CESM variables, it would also be fine to use nlat/nlon in the settings.jsonc as dimensions rather than TLAT/TLONG. I tried to get it to read in the variables using nlat/nlon rather than TLAT/TLONG following what you'd done previously for TLAT/TLONG, but it really only wants those.

@emaroon Variable names can be different, but the variable standard_name (or 'long_name' in the file if standard_name is not defined) needs to match for different conventions for the translation to work in the current framework. TLAT/TLONG will not map to 1D lat/lon in CMIP/GFDL because the standard_names differ, AND there are no lat/lon variables with equivalent standard_names in either of these conventions. I will think about how to handle this situation in the rewrite, but for now, your POD can ONLY handle CESM data if it expects TLAT/TLONG.

One idea is to implement alternate lat/lon coordinates in the vein of alternate variables if the primary ones are not available. For example:

 "tlat": {"standard_name": "array_of_t_grid_coordinates",
            "axis": "X",
            "units": "degrees_east,
            "alternates: ["lat"]},
"lat":  {"standard_name": "latitude",
            "axis": "X",
            "units": "degrees_east,
            "requirement": "alternate"}, 
...

The "alternate" snippet you provided above will be helpful - thank you! I see where I can use that.

Actually though, I don't really need TLONG/TLAT as dimensions, only as variables. It would be more useful for my POD, though not necessary, to read in the variables with nlon/nlat as the dimensions (because I can slice along those) and then read TLONG/TLAT in as static variables. However, even when I add nlon/nlat to the fieldlist and replicate exactly what you did previously with TLONG/TLAT, the framework refuses to read in the variables that way. (But I probably did something wrong...)

I think GFDL ocean output should also have the same dimension vs 2-d coordinate issue as CESM output. lat/lon for the GFDL ocean grid are 2-d variables called geolon and geolat (probably with _t and _u following) and then i,j are the dimensions (indices equivalent to CESM's nlon/nlat), I think. @jkrasting is this right? Maybe we need to add in ocean realm lat/lon dimensions and coordinates that differ from atmos realm lat/lon? Giving TLAT/TLONG the same standard name as geolon/geolat could be a solution.

Thanks again!

Indeed, native GFDL ocean model will face the same issue. MOM6 is on an Arakawa C grid and the 1D dimensions for the tracer cell centers yh and xh and the corners are yq and xq. The true 2D coordinates geolon,geolat and their variants for the u, v, and corner points should be used for calculations and plotting.

@emaroon - not sure if CESM is the same, but indexing/slicing using the dimensions is ill advised in the GFDL models. The values are "nominal coordinates" and appear to have a real world meaning, but they can differ a lot from the true coordinates. In fact, for OM5, we are debating changing yh,yq,xh, and xq to monotonic 1D arrays of integers (i.e. 1,2,3 .... 1078, 1079, 1080).

This past month, I started mocking up a class that standardizes these grid metrics and coordinates for the GFDL models. This solves a slightly different problem in that we have two different places at GFDL where the metrics are stored: in a file named ocean.static.nc and another file called ocean_hgrid.nc. The naming and conventions are slightly different, but the the objective of this class is to generate a grid metrics object regardless of source:

https://github.com/jkrasting/momgrid/blob/main/src/momgrid/classes.py

# Method 1
gridobj = momgrid.MOMgrid("ocean.static.nc")

# Method 2
gridobj = momgrid.MOMgrid("ocean_hgrid.nc")

Once the grid object is loaded, the coordinates can then be referenced in a standard way:

plt.pcolormesh(gridobj.geolon, gridobj.geolat, vardata)

@wrongkindofdoctor - what do you think of something like this in the MDTF to standardize the coordinates? Regardless of the model source, the 2-dimensional grid info is stored in some standardized object/format.

@jkrasting @emaroon Yes, a Python grid object instantiated from a file name specified in the runtime config file for the GFDL cases, along with other metadata like grid type and coordinate names to identify it in the data catalog, should be straightforward to implement using your momgrid code as a template for GFDL cases. CESM and CMIP ocean data cases would not require the file name (unless there are datasets in these conventions that store grid information in separate files like GFDL). I can provide an API to access the information, and show the implementation in the example POD.

@jkrasting @wrongkindofdoctor I think an ocean grid object specified for both GFDL and CESM cases would be a very useful feature. Creating a grid object file for the standard CESM2-POP grid would be easy, and I suspect it would also be easy for the standard CESM3-MOM grid (maybe @gustavo-marques has thoughts?).

That said, all the grid information is already included for CESM output in the default time series format, so a grid object is not strictly necessary. Using the CESM time series output for some variable to read in the grid information, rather than a static grid object, would also cover the case if someone is using a non-standard CESM grid. Taydra and I are working with both low res and high res CESM output for our POD development, so that flexibility would be useful for us. Another solution could be asking the user to provide an ocean grid file for CESM if it differs from the default.

@jkrasting I've never been warned using the nlon/nlat indices (1,2,3,4...), but the use of nlon/nlat is limited since they have no real-world meaning. If I don't already know the nlon/nlat indices for the box I want, I create masks with TLAT/TLONG.

I'm in the process of trying to figure out how to back out the grid information from CMIP output for one of our POD pieces. Depending on how that goes, what we develop might be useful for the framework. This would allow the framework to have at least an estimate of CESM, GFDL, and CMIP grids. I'm 75% of the way there on this, will keep you updated.

Thanks @wrongkindofdoctor.

@emaroon - It's good the dimensions are already integers in CESM. The major problem comes when they have a quasi-real-world resemblance as they do now in OM4.

Adding @33de6maggie (Xinru) into the discussion.

Sorry for the late response.

In the CESM3 development, we started to include the grid information (e.g., the variables in the static file) in all files. Although this slightly increases the amount of storage (even though these variables are all 2D), it facilitates the creation of catalogs and postprocessing in general.