Coordination with Rasters.jl?
rafaqz opened this issue · 6 comments
Looking through this codebase we seem to be duplicating a lot of work! Both building on on DimensionalData.jl and NCDatasets.jl.
You have some cool functions for filling missing values, and nice GeoMakie plots. Rasters.jl has fast rasterization/mask from polygons, read/write between formats, Plots.jl plots, and lots of other things.
Of course there are also a lot of climate specific things here too. But it seems like a shame to be duplicating so much effort.
Yeah, I agree completely and have thought of this many times. But I simply do not have the capacity to learn Rasters.jl and hence do the necessary work to remove the code duplication code here and instead use Rasters.jl :( At least not at the moment. The good thing is that the basic API is rather trivial: ncread(...) -> ClimArray
. So no matter how massive the source code change is, as long as this function still exists and is exported, I don't mind.
Here are some things I've coded that may, or may not, be in Rasters.jl and you can take them:
- The whole writing functionality, i.e., take an in-memory
ClimArray
(it is justDimensionalArray
, I only extended it to not commit type piracy as I wanted to extend some basic methods likeshow
). Then save theClimArray
(or a vector of them), into an .nc file. That took quite some effort to get right, including all missing + metadata business. Also the fact that it automatically adds the metadata that are mandatory according to NetCDF standard. - The functionality that transforms the dimensions from vectors to ranges if they are equidistantly spaced. This may be the most important step for performance.
- Automatic detection of the "coordinate grid" (
Coord
dimension) and correct loading of it. This was particularly hard and unfortunately depends strongly on how the data are initially saved :(
GeoMakie plots are very, very nice, and I've automated a large part of them and made them automatically updatable via the Observables pipeline so that you can simply pass them an Observable{ClimArray}
. I've tested all this stuff locally but I haven't pushed the documentation yet. Because..... There is a huge problem: GeoMakie itself is broken currently. So I'm waiting on that.
Is there something else I've missed? As far as I can tell everything else is already in Rasters.jl. (Of course, Rasters.jl has many other things that aren't here, but I'm not mentioning those)
No worries with the timing of doing this! Just good to know we can aim for that in future sometime. There is a lot here I wont have time to write either.
The vector to range conversion is interesting. Rasters doesnt actually do that currently. It does check that the step size is regular... but floating point error from an actual range conversion can also be a problem.
I've wanted GeoMakie/Makie plots for ages, but that instability and my lack of time is the reason Rasters doesnt yet. Hopefully it gets fixed. But your code would be great in Rasters.jl, and looks like it might drop in directly.
Rasters will write to netcdf, but currently doesnt add all necessary metadata. That would also be good to improve.
And hopefully Rasters is getting easier to understand than it used to be.
but floating point error from an actual range conversion can also be a problem.
In Julia ranges are represented with dual precision. Which means that the numbers actually have quadruple precision from the typical vectors of NetCDF, as the data there are always saved as Flaot32. So you get Float32 -> Float64 -> dual numbers in range. So this should cover most precision concerns? Maybe there are some edge cases, but also maybe they are not worth worrying about...?
TwicePrecision is pretty slow for some things, and you can get errors when you slice a section of a range, because the ends are still stored returned as Float64, and they don't exactly represent whole numbers. I guess you can carefully avoid it.
Edit: wrong
and they don't exactly represent whole numbers.
I don't follow this point. If the original data are integers, the range will also be integer. Endpoints will always be integers there. If the original data are Float32, then the range must also be with floats, and then the original data does not represent whole numbers either.
TwicePrecision is pretty slow for some things,
That is interesting, I didn't know that. Can you tell me in which scenarios a range would be slower than a vector explicitly containing all elements of the range?
In the case of DateTimes though, there things are certainly exact. And as a final step I explicitly first ask julia if range_data == vector_data
. Only if true
I proceed to use the range for the dimension values.
https://github.com/JuliaClimate/ClimateBase.jl/blob/master/src/core/nc_io.jl#L280-L288
When twice precision leaks into other math it can be slow. But I cant remember where it was when I benchmarked it, and I stopped caring about that really anyway.
But the idea was to not change the index from the original values unless you have to. Then load/save of a netcdf is an identical round trip. Slicing things up and joining them again also produces identical results with a vector index.
It may not really matter.