NOAA-GFDL/MDTF-diagnostics

Implement data frequency range specification in POD settings

Opened this issue · 0 comments

What problem will this feature solve?
The upcoming GFDL model run to generate POD sample data (diag table info was added for documentation purposes in #269) will only output data at a minimum 6-hourly frequency, to save data size. The two current high-frequency PODs, convective_transition_diag and precip_diurnal_cycle, can in principle do a valid analysis on data of this frequency, but are currently set to request data at 1hr and 3hr frequencies respectively.

In order to run these PODs on the sample model data being generated, the POD settings file format needs to be extended to allow PODs to request data in a range of acceptable frequencies, and the data query logic needs to be extended to execute that query.

Describe the solution you'd like
The user-facing changes have been described in the docs for some time, but the feature hasn't been implemented in the framework's data query logic. Each varlist entry in the POD settings file can have optional min_frequency and max_frequency attributes to specify a range of acceptable data frequencies, as an alternative to the currently recognized frequency attribute.

  • Input parsing: I believe the code to parse these settings from the json file is already functional.
  • Input validation: verify min_frequency <= frequency <= max_frequency for each varlist entry.
  • Query rewriting: We would like PODs to be able to specify frequency to identify a preferred frequency for data, with the min_frequency-max_frequency range defining a fallback option if data at frequency is not available. The general mechanism for doing so is specifying alternate VarlistEntries, via the edit_request() method on the preprocessor. For VarlistEntries with both frequency and min_frequency-max_frequency specified, this would need to insert an alternate with the min_frequency-max_frequency range after every alternate in the linked list of alternates. This would happen after edit_request() is called, since it's preprocessor-independent.
  • Query logic: querying on the min_frequency-max_frequency range has been implemented but not tested.
  • Query tiebreaker logic: this is necessary to handle the case in which the query finds multiple variables with frequency in the min_frequency-max_frequency range. This should be done by defining a base class for the ExperimentSelectionMixin classes in data_sources.py, and defining a resolve_var_expt() method acting on the DataFrame of data catalog entries to select the row with the desired frequency (presumably the highest available within the range.)
  • POD compatibility: The code for convective_transition_diag and precip_diurnal_cycle should be checked to verify that these PODs properly deal with data at different frequencies -- the claim above is based on the PODs' documentation only and hasn't been substantiated.

Describe alternatives you've considered
N/A

Additional context