Enhance MET to calculate weighted contingency table counts and statistics

Question

Enhance MET to calculate weighted contingency table counts and statistics

Closed this issue 2 months ago · 7 comments

Answer 1 · 2024-07-25T21:29:41.000Z

On 7/25/24, a meeting was held to discuss specifics of this issue and gather any unknown requirements.

The results of that meeting are collected in the meeting notes.

To accurately reflect what was discussed in the meeting, the contents of this issue will also be updated.

Answer 2 · 2024-10-03T16:46:09.000Z

@JohnHalleyGotway and @j-opatz discussed implementation details via Slack on 10/3/24. Here's a summary of why we decided to use the existing grid_weight_flag option rather than adding a new to provide finer configuration control.

From @JohnHalleyGotway:
I’m looking for some advice on implementation details of MET #2887. Grid-Stat has an existing grid_weight_flag config option (see description). I need a way to enable/disable weights being used for contingency table counts and stats.

The simplest choice is using the existing config option. If enabled, weights are applied to continuous stats (existing) and also contingency tables (new). However, that would NOT allow for weighted continuous stats and un-weighted categorical stats… which is what we currently get. In that way it’s not entirely backward compatible. But it sure is simple.

A less simple choice would be adding a new config option specifically for contingency table counts and stats. But IMHO that’s more confusing… but it would enable the changes to be more backward compatible.

From @j-opatz:
If it's advantageous of us to use the existing config options, I'd wager to say that most users who want weighted statistics, be they continuous or categorical, will want them all to be weighted. Even if they wanted one weighted and not the other, that's why we allow instance names in METplus wrappers; users could have one instance where all CTS and CNT output is weighted (via the grid_weight_flag COS_LAT or AREA settings) and another instance where none are. It's more output/run time, but it would allow us to use the existing logic.

Answer 3 · 2024-10-04T16:24:56.000Z

Running the full set of unit tests after updating Grid-Stat to apply weights to contingency tables produces many diffs. Here's some updated output, comparing the CTC line to the SL1L2 line for the same data:

V12.0.0 GFS   NA   240000    20120410_000000 20120410_000000 000000   20120410_000000 20120410_000000 WIND      m/s        P850     WIND      m/s       P850    GFSANL FULL          NEAREST     1           >OCDP90     >OCDP90          NA         NA    CTC       18333.60223 1523.51141   643.90538     692.14901 15474.03643      0.5
V12.0.0 GFS   NA   240000    20120410_000000 20120410_000000 000000   20120410_000000 20120410_000000 WIND      m/s        P850     WIND      m/s       P850    GFSANL FULL_BIN_MEAN NEAREST     1           NA          NA               NA         NA    SL1L2     29030          8.16125     8.23317       0         100.47851    102.08706   1.30328

Note that the values in the TOTAL column differ: 29030 != 18333.60223
The SL1L2 TOTAL column reports the number of matched pairs used, while the CTC TOTAL column reports the sum of the weights of the 2x2 cells. I don't like this. It'd be better to the CTC TOTAL column to continue reporting the number of matched pairs... and it'll be just the case that the sums of cells will no longer equal the TOTAL column when non-1.0 weights are used. Recommend updating the contingency table classes accordingly.

Note also that if TOTAL reports the integer number of matched pairs rather than the sum of the weights, then the conversion from FHO to CTC is no longer well-defined for non-default 1.0 weights. Specifically, the sum of the weights is not reported in the existing FHO line type. Consider adding a new WEIGHT_SUM column to the end of the line type so that CTC can be derived from FHO.

Answer 4 · 2024-10-08T17:11:28.000Z

Discussion of adding a new column to FHO line type was held in today's METplus wrappers meeting. After hearing the pros and cons, as well as the general reception of the rest of the METplus team that was present, it seems like the best option forward is to create a hard break: if a user desires area-weighted contingency table statistics, FHO is not a valid line type to request.

I am against the idea of creating an error if the user requests FHO while enabling area-weighted stats. Instead, a user should receive an FHO file that is empty, which indicates MET received the request for FHO but also will not fill it with meaningless values. If there is a desire to put a debug message that lets a user know that FHO output files were not filled in due to the use of area-weighted settings, that would be acceptable.

Answer 5 · 2024-10-08T19:10:59.000Z

As coordinated over Slack, adding a check to disable FHO output if grid weighting is requested. Here's the corresponding warning message:

WARNING: 
WARNING: GridStatConfInfo::process_config() -> Disabling FHO output that is not compatible with grid weighting. Set "grid_weight_flag = NONE" to write FHO output.
WARNING:

Answer 6 · 2024-10-08T19:45:51.000Z

@j-opatz FYI, I updated the existing tests in unit_grid_weight.xml to request the CTC/CTS/MCTC/MCTS output be requested since it is now impacts by the grid_weight_flag setting. Listed below are CTC counts for the same data but with different grid weighting options applied:

DESC           ... LINE_TYPE TOTAL FY_OY          FY_ON        FN_OY        FN_ON          EC_VALUE
NO_WEIGHT      ... CTC       15113 6163           102          91           8757           0.5
COS_LAT_WEIGHT ... CTC       15113 5439.37053     82.98455     74.59037     5778.8472      0.5
AREA_WEIGHT    ... CTC       15113 16813231.10054 256296.93192 230381.69187 17815176.73687 0.5

Note that the NO_WEIGHT line has un-weighted integer counts. So the weights are a constant value of 1.0.

The COS_LAT_WEIGHT line contains sums of weights, where the weights are defined by the absolute value of the cosine of the latitude... so bigger weights at the equator and smaller at the poles.

The AREA_WEIGHT line contains sums of weights, where the weights are defined by the true grid box area in kilometers. The result is very similar to the COS_LAT weighting for a lat/lon grid.

I notice that the sums of areas produces very large numbers. And I worry a bit about variable overflow. When multiplying a big number by another big number you get an even bigger number. Double-precision variables do support extremely large numbers (~1.79769e+308). So we should be fine.

I'm wondering if we should leave these numbers as-is. Or should we go through and normalize the weights to put them much closer to a value of 1.0? Is it better to report the weights exactly as-defined? Or is it better to normalize to make them more comparable to the un-weighted counts?

Answer 7 · 2024-10-08T20:08:28.000Z

Is it better to report the weights exactly as-defined? Or is it better to normalize to make them more comparable to the un-weighted counts?

This is a great question. While my first reaction was to support normalizing the data to avoid extreme value returns, I'm concerned doing so would create a "firewall" for users who want the actual values. Similar to the FHO line type, I'd like to avoid an outcome where users would benefit more from the raw values even if they become absurdly large over what we give them, which compresses the actual value in the name of cleanliness.

Enhance MET to calculate weighted contingency table counts and statistics

Describe the New Feature

Acceptance Testing

Time Estimate

Sub-Issues

Relevant Deadlines

Funding Source

Define the Metadata

Assignee

Labels

Milestone and Projects

Define Related Issue(s)

New Feature Checklist