Create new Pair-Stat tool to compute statistics for already paired forecast and observation data
Opened this issue · 7 comments
Describe the New Feature
Create a new statistics tool named Pair-Stat to compute statistics for already paired forecast and observation data. The initial version of this tool should support the following input datasets, although additional ones can be added in the future:
- IODA NetCDF files from the JEDI data assimilation system
- the ASCII MPR line type written by the Point-Stat tool
- Python embedding to supply MPR data
This new tool is driven primarily by the need to compute statistics for the already paired data in IODA files. Also supporting the MPR line type makes the functionality of this tool intersect with Stat-Analysis, which can already derive statistics from MPR data. The goal is to make the configuration of this tool more user-friendly instead of requiring users to wade through the details of defining many, many Stat-Analysis jobs.
The functionality of this tool overlaps with Point-Stat a lot. Although Pair-Stat will do no interpolation and no matching to message types. However care should be given to support filtering the input data:
- vertically by model level... separately or aggregating multiple levels together
- spatially by defining geographic masking regions and/or compute stats separately for each station
- temporally since data for multiple times can be passed as input
In the configuration file, let users define a list of variables names to be processed, or allow for an empty list to process all variables found in the input.
Remember to add a new chapter to the MET User's Guide for the new Pair-Stat tool.
List of questions to be considered:
- Should externally climatology data be supported?
- Should sample data percentile thresholds be supported?
Acceptance Testing
List input data types and sources.
Describe tests required for new functionality.
Time Estimate
Estimate the amount of work required here.
Issues should represent approximately 1 to 3 days of work.
Sub-Issues
Consider breaking the new feature down into sub-issues.
- Add a checkbox for each sub-issue here.
Relevant Deadlines
Work described in this issue should be completed by 12/30/2024
Funding Source
NRL METplus 7730022
Define the Metadata
Assignee
- Select engineer(s) or no engineer required
- Select scientist(s) or no scientist required
Labels
- Review default alert labels
- Select component(s)
- Select priority
- Select requestor(s)
Milestone and Projects
- Select Milestone as a MET-X.Y.Z version, Consider for Next Release, or Backlog of Development Ideas
- For a MET-X.Y.Z version, select the MET-X.Y.Z Development project
Define Related Issue(s)
Consider the impact to the other METplus components.
- METplus, MET, METdataio, METviewer, METexpress, METcalcpy, METplotpy
- Will need a new METplus wrapper to support the configuration of this new tool.
- No downstream METplus Analysis tools impacts since this tool will write to existing line types.
New Feature Checklist
See the METplus Workflow for details.
- Complete the issue definition above, including the Time Estimate and Funding source.
- Fork this repository or create a branch of develop.
Branch name:feature_<Issue Number>_<Description>
- Complete the development and test your changes.
- Add/update log messages for easier debugging.
- Add/update unit tests.
- Add/update documentation.
- Push local changes to GitHub.
- Submit a pull request to merge into develop.
Pull request:feature <Issue Number> <Description>
- Define the pull request metadata, as permissions allow.
Select: Reviewer(s) and Development issue
Select: Milestone as the next official version
Select: MET-X.Y.Z Development project for development toward the next official release - Iterate until the reviewer(s) accept and merge your changes.
- Delete your fork or branch.
- Close this issue.
Funding source added and deadline added.
Work in #3007 will support IODA files with Pair-Stat.
@willmayfield I'm wondering about the use of a grid within the Pair-Stat tool.
One of the first things done in the other MET statistics tools (e.g. Point-Stat, Grid-Stat, Series-Analysis, MODE, ...) is deciding on a common grid to be used for the verification. That can be defined as the "forecast" grid, "observation" grid, or some other grid, defined by it's name, grid specification string, or the path to a gridded data file. All gridded data is regridded to the common vx grid prior to be used and that includes:
- gridded forecast data
- gridded observation data, when applicable
- gridded climo data
- land/sea mask data
- topography data
- gridded masking regions created by Gen-Vx-Mask
Since Pair-Stat won't use gridded forecast/observation data, defining a verification grid is NOT REQUIRED. Instead, when extracting data from climo, land/sea mask, topography, gridded masks we could just use whatever grid that data happens to be defined on and interpolate to the (lat, lon) location of the pair.
The advantage is that avoiding those regridding steps will be a little faster and will introduce less "interpolation error".
The disadvantage is that it'll be less consistent with the logic of the other MET statistics tools.
Shall I proceed WITHOUT defining a common "verification grid"?
Or should I use one to maintain more consistency with the logic of other tools?
As discussed on Nov 22, 2024 with @DanielAdriaansen and @willmayfield, recommend NOT using a common verification grid since no doing so seems to be the simpler approach. If adding back in this functionality is requested in the future, it can be added at that time.
As discussed on Dec 4, 2024 with @georgemccabe, for setting up config options to filter input paired data, recommend:
- Reusing the existing
mpr_column
andmpr_thresh
config options from Point-Stat and Grid-Stat to filter numeric columns (or differences or abs value of differences) from MPR data. - Adding new
mpr_str_inc
andmpr_str_exc
config options to filter input paired data by string matching inclusion and exclusion. These are arrays of dictionaries withname
andvalue
entries:
mpr_str_inc = [ { name = "DESC"; value = "NA"; } ];
mpr_str_exc = [ { name = "VX_MASK"; value = "CONUS"; } ];
Note that this introduces some inconsistency since mpr_str_inc/exc
are arrays of dictionaries while mpr_column/thresh
are arrays of strings and thresholds. However we agree that this is a preferable design and users will set these via METplus Wrappers anyway.
As discussed on Dec 6, 2024 (see meeting notes), add a new group_name
config option to specify the group name from which the variable name
should be extracted.
@JohnHalleyGotway After our discussion on Friday, I dug into some of the files in https://github.com/JCSDA-internal/ufo-data/tree/develop/testinput_tier_1.
An instructive file might be amsua_n19_hofxnm_2018041500_m_rttovcpp.nc4.
This file has one variable, brightness_temperature, with observation group ObsValue, possible "forecast" groups HofX and MPASJEDIHofX, dimension "Location" (size 100), as well as "Channel" (size 15) which may be desired to specify for the verification task. Channel takes values in the MetaData group along with coordinates of height, latitude, longitude, and datetime.
There are several other MetaData available such as sensorZenithAngle(Location), sensorPolarizationDirection(Channel), etc. which I am not sure if they would be desirable to be used in, for example, a filter job. That may need to be left to the user to perform independently.
For a very simple file with a more traditional variable, you could look at sondes_q_obs_2020121500_singular.nc4.
This file has the variable specificHumidity, with groups ObsValue, hofx, GsiHofx, etc, and within MetaData there are variables datetime, latitude, longitude, and possible vertical coordinates height, pressure, and stationElevation. There are also, for example, MetaData information in stationIdentification which again might be useful in a filter job, but I'm not sure if that's within our immediate scope of capabilities.
Please let me know if you have any questions or would like to discuss (I'll find a meeting time in the next few days either way).