ClimateImpactLab/climate_toolbox

Check nan handling in aggregation

delgadom opened this issue · 4 comments

Proposed spec:

For a given region to be aggregated:

  • If there are no NaNs in a region, you're good. Just aggregate the data.
  • If a NaN is present in a region, but the region is not entirely NaNs, then make sure to not drop any weight, but to aggregate the data in the region using the weight in the non-nan pixels
  • If an entire region is NaN, the output should be NaN

Note that no interpolation is done in time, space, or any other dimensions with these tools

This should be tested in the output

@atrisovic @andyhultgren here is the proposed NaN handling spec. Does this seem right?

@delgadom @atrisovic Yes this is right. Here is the full set of NaN conditions:

    1) defaults to area weights when user-defined weights are
        missing/zero for an entire adm polygon
    2) maintains NaN gridcell values (in climate data) instead of defaulting 
        to zero
    3) reweights around NaN gridcells in climate data when aggregating spatially
        (only outputs NaN value per region-day if all pixels are missing)
    4) assigns NaN values in climate data to a given unit of temporal aggregation
        (month or year) if any day has a NaN value

For example, our transforms of the climate data are structured to pass NaNs through like this:
my_poly = lambda x: x if np.isnan(x) else pow(x, term_val)
vfunc = np.vectorize(my_poly)
transdata = vfunc(raw_data)

And then when we spatially collapse gridcells to adm units, we take the weighted average while ignoring missing values (unless they are all missing for a given adm unit, in which case NaN is returned):
num = grp._get_numeric_data()
return num.multiply(num[weight_col], axis=0).sum() / num.notnull().astype('int').multiply(num[weight_col], axis=0).sum()

However when aggregating over time (for an adm unit) if e.g. one day has missing data for the entire adm unit, then the temporally-aggregated output (say a month containing that day) is also reported as missing. As in:
return grp.sum(skipna=False, axis=1)

All of these code snippets are from the master branch of the merge_transform_average.py script.

@andyhultgren thanks! You're absolutely right - @atrisovic the aggregation math in aggregations.py line 92 does need to be changed:

    weighted = xr.Dataset({
        variable: (
            (
                (ds[variable]*ds[aggwt])
                .groupby(agglev)
                .sum(dim='reshape_index')) /
            (
                ds[aggwt]
                .groupby(agglev)
                .sum(dim='reshape_index')))})

should be

    weighted = xr.Dataset({
        variable: (
            (
                (ds[variable]*ds[aggwt])
                .groupby(agglev)
                .sum(dim='reshape_index')) /
            (
                ds[aggwt]
                .where(ds[variable].notnull())
                .groupby(agglev)
                .sum(dim='reshape_index')))})

also @andyhultgren, sorry for the nitpicky off-topic point, but just so you know np.vectorize is super slow compared with using actual vector math operations. It's essentially just a for-loop, so this is a pure python cell-wise operation:

my_poly = lambda x: x if np.isnan(x) else pow(x, term_val)
vfunc = np.vectorize(my_poly)
transdata = vfunc(raw_data)

You'd see major speedups with something like this:

transdata = np.where(np.isinan(raw_data), raw_data, raw_data ** term_val)

Although exponentiation does preserve nans, so this doesn't actually do anything and could be written:

transdata = raw_data ** term_val

Oh and to completely respond - the aggregation code only aggregates in space - aggregation in time is defined in each transformation. mortality polynomials preserve daily resolution, Ag GDD/KDDs sum to the month, and energy is annual. We'll make sure the transformations fit the current spec, but just so you know this will ultimately be up to the user to make sure the transformations are done right.