Implement data frequency range specification in POD settings
Opened this issue · 0 comments
What problem will this feature solve?
The upcoming GFDL model run to generate POD sample data (diag table info was added for documentation purposes in #269) will only output data at a minimum 6-hourly frequency, to save data size. The two current high-frequency PODs, convective_transition_diag and precip_diurnal_cycle, can in principle do a valid analysis on data of this frequency, but are currently set to request data at 1hr and 3hr frequencies respectively.
In order to run these PODs on the sample model data being generated, the POD settings file format needs to be extended to allow PODs to request data in a range of acceptable frequencies, and the data query logic needs to be extended to execute that query.
Describe the solution you'd like
The user-facing changes have been described in the docs for some time, but the feature hasn't been implemented in the framework's data query logic. Each varlist
entry in the POD settings file can have optional min_frequency
and max_frequency
attributes to specify a range of acceptable data frequencies, as an alternative to the currently recognized frequency
attribute.
- Input parsing: I believe the code to parse these settings from the json file is already functional.
- Input validation: verify
min_frequency
<=frequency
<=max_frequency
for each varlist entry. - Query rewriting: We would like PODs to be able to specify
frequency
to identify a preferred frequency for data, with themin_frequency
-max_frequency
range defining a fallback option if data atfrequency
is not available. The general mechanism for doing so is specifying alternate VarlistEntries, via the edit_request() method on the preprocessor. For VarlistEntries with bothfrequency
andmin_frequency
-max_frequency
specified, this would need to insert an alternate with themin_frequency
-max_frequency
range after every alternate in the linked list of alternates. This would happen after edit_request() is called, since it's preprocessor-independent. - Query logic: querying on the
min_frequency
-max_frequency
range has been implemented but not tested. - Query tiebreaker logic: this is necessary to handle the case in which the query finds multiple variables with frequency in the
min_frequency
-max_frequency
range. This should be done by defining a base class for the ExperimentSelectionMixin classes in data_sources.py, and defining a resolve_var_expt() method acting on the DataFrame of data catalog entries to select the row with the desired frequency (presumably the highest available within the range.) - POD compatibility: The code for convective_transition_diag and precip_diurnal_cycle should be checked to verify that these PODs properly deal with data at different frequencies -- the claim above is based on the PODs' documentation only and hasn't been substantiated.
Describe alternatives you've considered
N/A
Additional context