JuliaClimate/ClimateBase.jl

Aggregation functions that skip missing values

pkeil7 opened this issue · 8 comments

Currently, aggregation functions like mean() return a missing value, if there is one missing value in the field. However for many applications you would want the missing values to be skipped, e.g. if you only have land surface temperature and have missing values over water. The skipmissing() function should be able to do this: https://docs.julialang.org/en/v1/manual/missing/#Skipping-Missing-Values . You could even argue that this should be the default behaviour?

No, this absolutely should not be the default behavior. In software design you want to minimize assumptions you make for the user. The proper mathematical value of the average/median/whatever of an array containing missing values is missing, as dictated by missing's logic.

Since a function skipmissing exist, they can use it. It isn't a big deal to write the extra 10 characters necessary. Did you try to use it and it didn't work? If so, please report the error message.

BTW, if your array has missing values in the time domain (e.g. some days or months are not covered) then you can consider the function sinusoidal_continuation that we offer.

Well, it doesn't seem to work out of the box: zonalmean(skipmissing(A)) returns a Methoderror:

MethodError: no method matching zonalmean(::Base.SkipMissing{ClimArray{Float64, 3, Tuple{Coord{Vector{SVector{2, Float64}}, DimensionalData.CoordMode{Tuple{UnionAll, UnionAll}}, DimensionalData.NoMetadata}, Ti{Vector{Float64}, DimensionalData.Sampled{DimensionalData.Ordered{DimensionalData.ForwardIndex, DimensionalData.ForwardArray, DimensionalData.ForwardRelation}, DimensionalData.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Points}, Dict{String, String}}, Pre{Vector{Float64}, DimensionalData.Sampled{DimensionalData.Ordered{DimensionalData.ReverseIndex, DimensionalData.ForwardArray, DimensionalData.ForwardRelation}, DimensionalData.Irregular{DimensionalData.AutoBounds}, DimensionalData.Points}, Dict{Any, Any}}}, Tuple{}, Array{Float64, 3}, Dict{String, Any}}})

I get you point about the proper mathematical value, but this issue arises quite often in climate science I think. Other tools, like cdo and python's xarray therefore also skip missing values by default. So I would argue that in climate science the default assumption is that missing values are skipped.

What does it mean to do zonalmean(skipmissing(A))? To do a mean(skipmissing(A)) is valid, because the dimensional structure of A doesn't matter; only its contained numbers, and mean(skipmissing(A)) is also a number. But for the zonal mean what's happening?

Should you weight every latitude by the amount of the non-missing points? Does cdo does it? From a statistics perspective this question is of fundamental importance, yet undefined when you write zonalmean(skipmissing(A)). I believe that you are interested in the case where each latitude is not weighted by the amount of non-missing points. Notice that in this scenario the integral of the output of zonal mean is physically incorrect. The same is true for the average of zonal mean. Both of these values are incorrect if you do not weight the zonal mean output by the amount of non-missing points in each zone. This weighting must happen in addition to the weighting by area (the multiplication with cos(θ).

But in any case, I digress. A researcher should know this caveat and properly do their statistics. It is really easy to modify the zonalmean function to do skip missing. Try to do a PR for it, and if you need help let me know and I'll guide you. If you can't do it, then I'll do it.


BTW, I have the exact same situation as you, but I follow a very different approach. Instead of putting missing in the values of A, I have a second array W which are statistical weights. They have the value 0 over land and 1 over ocean, and values between those two if the ocean area fraction isn't completely 100%. Then I do a statistically weighted zonal mean, zonalmean(A, W).

I now realized that I haven't put these functions in ClimateBase.jl though :D haha sorry! Here is the source code:

# Allow zonal mean with weights
ClimateBase.zonalmean(A::ClimArray, W) = zonalmean(spacestructure(A), A, W)
function ClimateBase.zonalmean(::UnstructuredGrid, A::ClimArray, W)
	@assert size(A) == size(W)
    idxs, lats = uniquelats(A)
    other = otherdims(A, Coord())
    r0 = zeros(eltype(A), (length(lats), size.(Ref(A), other)...))
    R = ClimArray(r0, (Lat(lats), other...); name=A.name, attrib=A.attrib)
    for (i, r) in enumerate(idxs)
        for j in otheridxs(A, Coord())
            R[Lat(i), j...] = mean(view(A, Coord(r), j...), weights(view(W, Coord(r), j...)))
        end
    end
    return R
end
function ClimateBase.zonalmean(::UnstructuredGrid, A::ClimArray{T, 1}, W) where {T}
    idxs, lats = uniquelats(A)
    res = zeros(T, length(lats))
    for (i, r) in enumerate(idxs)
        res[i] = mean(view(A.data, r), weights(view(W.data, r)))
    end
    return ClimArray(res, (Lat(lats),); name=A.name, attrib=A.attrib)
end

This is in general the approach I have followed so far: instead of using "land masks" use statistical weights. I find it mathematically better because it allows weights that are neither 1 (value exists in A) or 0 (missing value in A), which is what the missing approach does.


@Balinus by the way, what's your take on this issue? Should we try to support as much missing as possible? Should we wrap the data with a skipmissing for literally every function we have, and do whatever assumption we need to make it happen? This will require a huge effort, because of course every single function must be modified. And our "weighted" versions also need to be modified.

I'm not sure I understand all of your points. For zonalmean it should be straightforward, no? I would only use those points that are not missing, just as the skipmissing function does. So for one latitude: if you have 20 points over land with missing values and 2 points over the ocean with values 5 and 6 then the zonal mean would be 5.5. I'm not sure I understand your example with the integral, but yes, if you for example do a meridional mean of the zonal mean, then you have lost the weights and it would be incorrect. But that is not what zonalmean is for, or? Instead, in that case you should do a spacemean, which would also have to be modified.

In your code example, one could add the a paramter ignore_missing, which if true, constructs a weightfile that has zeros where the missing values are.

Without going into the technical detail talked already, my take on this is that missing values are so common in climate sciences that we need something for non finite values. For example, I do not know about an existing weather station that would not have a single missing value in their timeserie. Hence, as soon as we deal with weather station, you would need to add that famous 10 characters each time you are doing statistics. The same apply to climate simulations, but it's more about corner cases, as mentioned in this thread.

So, by experience, I always have an option to ignore missing values. Note that I prefer to have this behaviour set to false by default. I want my functions to return a missing/NaN by default and make a conscious effort to add the appropriate flag when needed. Otherwise, you might miss a bug somwehere in your pipeline. For example, imagine you have incorreclty dealt with one of your function and it returns all NaNs except for one finite value: the rest of the pipeline will not properly account for this bug and you might not see it.

What I did in ClimateTools was to use the algorithms in Images.jl

For example, see this function
https://github.com/JuliaImages/Images.jl/blob/a80031eb21662911f7217c6cad7ef0b676002913/src/algorithms.jl#L12

Yes you are right, the zonalmean's purpose is not to consider weighting for missings.

In your code example, one could add the a paramter ignore_missing, which if true, constructs a weightfile that has zeros where the missing values are.

That would be wonderful if possible, but that's not how missing values behave. 0*missing is still missing, by definition.

Which means that

julia> x = [5, 6, missing]
3-element Vector{Union{Missing, Int64}}:
 5
 6
  missing

julia> w = StatsBase.weights([1, 1, 0])
3-element Weights{Int64, Int64, Vector{Int64}}:
 1
 1
 0

julia> sum(x .* w)
missing

Regardless, I agree that missing stuff should be supported more rigorously, but this will need more manpower here. At the moment the possibility of weights that I follow satisfies every single case of missing data. You can modify your array and attribute them the value 0 to every index they have missing, and also make statistical weights with 1 where we have non-missig values, and 0 where we have missing.

But you have to modify the original array nevertheless.

Actually I have a simple design idea: create a function missingweights(A, val = 0.0) This function takes an array A and returns two arrays B, W. One is the array A, converted to have value val when A has missing, and W is an array that has values 1 for non-missing and 0 for missing.

We can offer this simple function to users and the users can use the output with all existing functions no problemo.