JuliaClimate/ClimateBase.jl

replace AxisArrays with GeoData / DimensionalData

Closed this issue · 5 comments

I've started using DimensionalData (or its derivative GeoData) which seems to be good and also provides strong high-level selection functionality. Shouldn't you consider to switch ClimGrid to GeoData and also simplify the core struct in the process? I've started contributing to DimensionalData so I have a basic familiarity of the package and I can help with this.

OR at least establish a convertion ClimGrid that takes a GeoArray and makes it a ClimGrid.

By the way, I am curious, why is it that

  longrid::AbstractArray{N,2} where N # the longitude grid
  latgrid::AbstractArray{N,2} where N # the latitude grid

are 2 dimensional, when clearly both latitude and longitude are one dimensional vectors?

Yes, I think this is centainly one option for the future of ClimateBase. When I played with DimensionalData (not a lot and it was like 3 or 4 months ago), I wasn't able to reproduce the provided examples so I wasn't sure how mature it was. Perhaps I played with GeoData though, I'm not 100% sure it was DimensionalData. At the time, documentation was sparse so I wasn't too sure how to use the package correctly.

Now, I see that the package is registered and perhaps this is a good time to re-explore the option. I think that if we go with DimensionalData, it's probably a "simple" replacement of AxisArrays, whereas if we use GeoData, a lot of functionalities would be baked-in (multiple file formats, georeferenced fields, etc...).

Anyway, I'm all ears for a replacement proposition! :) I,ll have to check how metadata is handled by GeoData. Right now, the objective is to be able to write CF-compliant (or close to) netCDF files by using ClimateTools (e.g. load data, do computations, bias-corrections, subset data, etc... then write the results to a netCDF file).

By the way, I am curious, why is it that

  longrid::AbstractArray{N,2} where N # the longitude grid
  latgrid::AbstractArray{N,2} where N # the latitude grid

are 2 dimensional, when clearly both latitude and longitude are one dimensional vectors?

latgrid and longrid are not the dimension vectors of the datasets (lon_raw and lat_raw are). This is a workaround to easily subset data spatially with polygons. For example, Regional Climate Models are often on non-regular grids (e.g. Polar stereographic grids, etc..). With non-regular grids, most netCDF datasets provide lat-lon arrays.

The idea is that I wanted to simply throw in polygons defined in lat-lon coordinates and subset the data. There were 2 options: convert the polygons to native grid dimensions (and building clever functions for all those transformations or using GDAL) or build latitude and longitude arrays mimicking the dimensions of the data arrays. I chose the latter: much simpler! It also provided an easy solution for surface maps. I had no need to transform the datasets from one projection to another.

When there is a regular lat-lon grids (most Global Climate Models), I build latgrid and longrid arrays from the regular dimensions of the dataset.

Hope my explanation are clear!

GeoData interfaces NCDatasets.jl and takes care of loading a .nc file and converting directly to a DimensionalArray-like structure. Unfortunately its documentation is really basic at the moment, same as DimensionalData... Hope the owner there considers moving to an org at some point.

The metadata functionality of GeoData.jl is also inferior to that of NCDatasets.jl at the moment, so maybe its worth waiting: https://github.com/rafaqz/GeoData.jl/issues/22

Thanks for the information!

I guess that ClimGrid could be a struct with a GeoArray as the data and keep the metadata as it is right now. (I should take the time to clean up the struct also as some fields are available elsewhere).

struct ClimGrid
  data::GeoArray
  longrid::AbstractArray{N,2} where N # the longitude grid
  latgrid::AbstractArray{N,2} where N # the latitude grid
  msk::Array{N, 2} where N
  grid_mapping::Dict # bindings of native grid
  dimension_dict::Dict
  model::String
  frequency::String
  experiment::String
  run::String
  project::String # CORDEX, CMIP5, etc.
  institute::String
  filename::String
  dataunits::String
  latunits::String # of the coordinate variable
  lonunits::String # of the coordinate variable
  variable::String # Type of variable (i.e. can be the same as "var", but it is changed when calculating indices)
  typeofvar::String # Variable type (e.g. tasmax, tasmin, pr)
  typeofcal::String # Calendar type
  timeattrib::Dict # Time attributes
  varattribs::Dict # Variable attributes
  globalattribs::Dict # Global attributes
end

I played last night with GeoData and ESDL.

As of January 2020, ESDL is easier to use, with lots of example. Some examples does not work though. What I liked with ESDL is the out-of-core capabilities (you can do calculations with datasets much larger than available RAM). There are also plotting capabilities with ESDLPlots and some notebooks here.

With GeoData, I had less success when I followed the examples: the public API does not seem stable right now. The example script shows NCstack but that throws an error. It seems to have changed to NCDstack. I also had problems with other commands in the example script.

Both library are not ready yet for a direct replacement of AxisArrays. The fact that they are not yet registered is a good indication that the authors are still heavily developing the functionalities.

I'll have a look at DimensionalData when I have time. I understand that the package is more mature, which is logical since this will be the backbone of GeoData.

If you have some examples/scripts/notebooks where DimensionalData is used, I'll be happy to look at them!

edit - NamedDims seems to also be a solution perhaps?

I don't think NamedDims is an option, because it doesn't bundle the dimensions into the array, but only the names. Having the dimensions and data in one structure is extremely useful, at least for me personally, but I also imagine in general. Besides convenience it allows It also allows sensible implementations of selecting a dimension by the values of that dimension, that both AxisArrays and DimensionalData have. For my line of work this is really the thing I care about: being able to select by dimension value. Maybe https://github.com/invenia/IndexedDims.jl is worth considering but it is also without documentation.

You are right, GeoData is unstable. And both it and the core DimensionalData are almost entirely undocumented. Another unfortunate thing (at least for me) is that they are not in an organization, but in the name of an individual, who will keep them in their own name. I am not sure how impactful this will be for building a community around climate in the future... But the owner has made it clear several times that they don't have time to write docs, and I am always skeptical about what would be accepted as a PR and what not, as this is not clearly stated anywhere and thus falls to the personal whims of the owner.

The only thing I use from GeoData is the conversion from a file .nc on my computer to DimensionalArray. After this point I only use the core DimensionalArray functionality.

The reason I started using DimensionalData instead of AxisArrays is that it has less verbose syntax. The "extandability" claims they mention aren't anything notable for me, one can dispatch on symbol parameters in Julia anyway. However, the other thing that is relevant is also that all functions like mean, sum, etc. that take in dimensions are implemented for DimensionalData.

Perhaps we don't need GeoData at all, and we can make our own convertion to DimensionalArray directly. This is also safer as we have one less dependency to worry about (and in fact I was already considering doing it for my personal project).

As far as code is concerned, here is what I do:

TOA = NCDstack(datadir("CERES", "CERES_EBAF_TOA.nc"))
times = TOA["time"]
times = Date.(times.data)  # only keep numeric data as dates, discard metadata
months = monthname.(times[1:12])
lat = TOA["lat"]
cosθ = @. cos(deg2rad(lat))
weights = cosθ/sum(cosθ)

S = TOA["solar_mon"]
drop(mean, S; dims = Lon)

where for convenience I have defined

drop(f, A, dims) = dropdims(f(A; dims = dims); dims = dims)

here both mean as well as dropdims work with S by specifying dimension as the type.

So this code would do a zonal average of S.