E3SM-Project/e3sm_diags

[Bug]: `cosp_histogram_standardize()` conditional for adding bounds is incorrect and default bounds length might not align with coordinates

Opened this issue · 5 comments

What happened?

In PR #748, I found a silent bug in a conditional statement for handling bounds in cosp_histogram_standardize().

prs_bounds = getattr(prs, "bounds")
if prs_bounds is None:
cloud_prs_bounds = np.array(
[
[1000.0, 800.0],
[800.0, 680.0],
[680.0, 560.0],
[560.0, 440.0],
[440.0, 310.0],
[310.0, 180.0],
[180.0, 0.0],
]
) # length 7
prs.setBounds(np.array(cloud_prs_bounds, dtype=np.float32))

The conditional follows this flow:

  1. Check if the axis has a "bounds" attr
  2. If "bounds" attr exists, don't add bounds
    - Issue: bounds are never added for the coordinates that don't actually have bounds
  3. If "bounds" attr does not exist, add bounds
    - Issue: the default bounds length (cloud_prs_bounds) might not align with the length of the coordinates, resulting in this error: ValueError: conflicting sizes for dimension 'misr_cth': length 7 on the data but length 16 on coordinate 'misr_cth'

What did you expect to happen? Are there are possible answers you came across?

In my version of the code, I get the "bounds" attr AND check if the bounds actually exist in the dataset.

If bounds don't exist in the dataset, bounds should be dynamically generated to align with the length of the coordinates instead of using a fixed bounds array (I think?).

def _add_missing_cloud_bnds(
    ds: xr.Dataset, axis_var: xr.DataArray, axis: CloudAxis
) -> xr.Dataset:
    bnds_key = axis_var.attrs.get("bounds")
    if bnds_key is None or bnds_key not in ds.data_vars.keys():
        # Should align with the length of the data.
        bnds_data = `_get_missing_cloud_bnds()`
        default_bnds = xr.DataArray(
            name=f"{axis_var.name}_bnds",
            data=bnds_data,
            dims=list(axis_var.dims) + ["bnds"],
            coords=axis_var.coords,
        )
        ds[default_bnds.name] = default_bnds

    return ds

Minimal Complete Verifiable Example (MVCE)

import cdms2

# Make sure bounds are not automatically generated
cdms2.axis.setAutoBounds(0)
ds = cdms2.open(
    "tests/integration/integration_test_data/CLDMISR_ERA-Interim_ANN_198001_201401_climo.nc"
)

# Get the prs axis.
prs = ds["CLD_MISR"].getAxis(1)

# Get the prs "bounds" attr.
prs.bounds
# 'misr_cth_bnds'

# Check if  "misr_cth_bnds" exists in the dataset.
prs.getBounds()
# None

# This logic is in `cosp_histogram_standardize()`, which is incorrect because `prs.getBounds()` is `None`.
if prs.bounds is None:
    print("Add bounds here")

# Print out the variables. Notice how there is no "misr_cth_bnds"
ds.variables
# {'misr_tau_bnds': <cdms2.fvariable.FileVariable at 0x7f3dbdc8aec0>,
#  'lat_bounds': <cdms2.fvariable.FileVariable at 0x7f3953145db0>,
#  'lon_bounds': <cdms2.fvariable.FileVariable at 0x7f3953146650>,
#  'CLMISR': <cdms2.fvariable.FileVariable at 0x7f39531462f0>}

Relevant log output

No response

Anything else we need to know?

No response

Environment

Latest main version of e3sm_diags and all previous versions that use the cosp_histogram_standarize() function.

Hi @chengzhuzhang, I believe I found a bug in the old version of cosp_histogram_standardize(). Let me know if you agree with my solution, which includes adding missing bounds dynamically.

I might need help understanding the logic for the function that dynamically generates bounds (e.g., use coordinates as midpoints?).

Apologize for missing out this bug report! It is an interesting silence bug. And it almost suggests that the bounds are never used in the downstream calculation? I will have to look more.

Apologize for missing out this bug report! It is an interesting silence bug. And it almost suggests that the bounds are never used in the downstream calculation? I will have to look more.

As far as I can tell, the old CDAT version of cosp_histogram_standardize() and cosp_histogram_driver.py don't reference prs or tau bounds for any calculations. cosp_histogram_standardize() subsets the variable using the low and high values of prs and/or tau, not prs bounds and/or tau bounds.

Hi Tom. I agree with your assessment that the cosp_histogram_standardize() are never actually used in the cdat version of the code. The conditions such as prs_bounds is None is never meet, because prs_bounds always has a string type, if the bound attribute exist. And since the bounds are not referenced in calculation, we can perhaps remove this function?

Thanks for investigating @chengzhuzhang.

We can simplify the codebase by removing this function since it doesn't seem necessary to keep. Bounds are not used for calculations and you confirmed that newer model data and reference data have bounds (your comment).