nc_axis row order
dblodgett-usgs opened this issue · 4 comments
Based on this: https://github.com/hypertidy/ncmeta/blob/master/R/nc_axes.R#L46 the row order of the response from nc_axis is in the order of the RNetCDF::var.inq.nc()$dimid
vector. Is this by design? I ask because that $dimid
for a variable determines what order you put dimensions in start and count for RNetCDF::var.get.nc()
and it's pretty important to get right!!
I wonder if adding a list column to the nc_var() response that would contain the $dimid vector would be safer? The fact that the row order corresponds dimension order is being used here: https://github.com/r-spatial/stars/blob/master/R/ncdf.R#L58 already so at the least there should be some testing to make sure that the row order doesn't get mucked up?
var <- "ppt"
h_nc <- RNetCDF::open.nc("https://cida.usgs.gov/thredds/dodsC/prism_v2")
dplyr::filter(nc_axes(h), variable == var)$dimension
# [1] 1 0 3
RNetCDF::var.inq.nc(h_nc, var)$dimids
# [1] 1 0 3
I think it's correct this way. I'm not sure about by design, but I definitely checked these things early on.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ncmeta)
h_nc <- RNetCDF::open.nc("https://cida.usgs.gov/thredds/dodsC/prism_v2")
nc_axes(h_nc, variables = "ppt") %>%
dplyr::left_join(nc_dims(h_nc), c("dimension" = "id"))
#> # A tibble: 3 x 6
#> axis variable dimension name length unlim
#> <int> <chr> <dbl> <chr> <dbl> <lgl>
#> 1 1 ppt 1 lon 1405 FALSE
#> 2 2 ppt 0 lat 621 FALSE
#> 3 3 ppt 3 time 1476 FALSE
Created on 2019-03-20 by the reprex package (v0.2.1)
Is row order of the nc_axes tibble enough? I'm cool if we want to go with that convention. It should be advertised in docs and tested at the least. I can take a crack at that when I look at nc_coord_var() too.
This the logic of axes:
f <- "https://cida.usgs.gov/thredds/dodsC/prism_v2"
x <- tidync(f)
x
Data Source (1): prism_v2 ...
Grids (5) <dimension family> : <associated variables>
[1] D1,D0,D3 : ppt, tmx, tmn **ACTIVE GRID** ( 1287817380 values per variable)
[2] D2,D3 : time_bnds
[3] D0 : lat
[4] D1 : lon
[5] D3 : time
Dimensions 4 (3 active):
dim name length min max start count dmin dmax unlim coord_dim
<chr> <chr> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl>
1 D0 lat 621 24.1 49.9 1 621 24.1 49.9 FALSE TRUE
2 D1 lon 1405 -125. -66.5 1 1405 -125. -66.5 FALSE TRUE
3 D3 time 1476 9131 54025 1 1476 9131 54025 FALSE TRUE
Inactive dimensions:
dim name length min max unlim coord_dim
<chr> <chr> <dbl> <dbl> <dbl> <lgl> <lgl>
1 D2 tbnd 2 1 2 FALSE FALSE
See how the variables are D1, D0, D3, but the dimensions are in order of their ID. (I think that table of active dimensions should be in D1, D0, D3 order too - something to fix, but I want it to not be all cat/print anyway).
From the perspective of ncmeta,
we get all dimensions in source order
ncmeta::nc_dims(f)
# A tibble: 4 x 4
id name length unlim
<dbl> <chr> <dbl> <lgl>
1 0 lat 621 FALSE
2 1 lon 1405 FALSE
3 2 tbnd 2 FALSE
4 3 time 1476 FALSE
We get axes (instances of dimensions) for variables in coordinate order, so axis
here is the 1, 2, 3 order.
ncmeta::nc_axes(f, "ppt")
# A tibble: 3 x 3
axis variable dimension
<int> <chr> <dbl>
1 1 ppt 1
2 2 ppt 0
3 3 ppt 3
Loosely, the dimension
id matches id
in the dimension table - that's something to formalize, and I'm not sure how consistent I am. In places I use "name" on the entity table, not "variable" or "dimension", but there "axis" is not a name but an ordering, so ...
Frankly, I have gone around in circles a bit with this but somehow got it right last time. It takes me a while to get my head into it again - definitely needs to be pinned down with tests, so we can clean up the code somewhat.
There was a stage when I was unpacking the dimension strings "D1,D0,D3" but that's bad news, I really think axis is the right concept, and in context of a variable they are ordered (and labelled with an order attribute axis
).
It also holds for all axes, but here the axis order label is absolute (not sure if that's good or bad):
ncmeta::nc_axes(f)
# A tibble: 14 x 3
axis variable dimension
<int> <chr> <dbl>
1 1 lon 1
2 2 lat 0
3 3 time 3
4 4 time_bnds 2
5 5 time_bnds 3
6 6 ppt 1
7 7 ppt 0
8 8 ppt 3
9 9 tmx 1
10 10 tmx 0
11 11 tmx 3
12 12 tmn 1
13 13 tmn 0
14 14 tmn 3
Yeah -- I see what's going on now. You get the dimension order right in rows because of this:
Line 59 in cea1b14
This:
Browse[2]> out
$id
[1] 4
$name
[1] "ppt"
$type
[1] "NC_INT"
$ndims
[1] 3
$dimids
[1] 1 0 3
$natts
[1] 6
becomes:
Browse[2]> out
$id
[1] 4 4 4
$name
[1] "ppt" "ppt" "ppt"
$type
[1] "NC_INT" "NC_INT" "NC_INT"
$ndims
[1] 3 3 3
$dimids
[1] 1 0 3
$natts
[1] 6 6 6
Since dimids is the only part of the var.inq.nc
that varies, the row order of the resulting tibble has the right order. It always will too.
My feeling is that it would be as good or better to put the dimid vector into a list column for each variable and drop the whole axis thing -- but maybe I'm missing the utility of the axis construct?
Happy to start writing some code around this way of doing things and just be aware of things being sorted in "axis" order too.
> s <- "https://cida.usgs.gov/thredds/dodsC/stageiv_combined"
> nc_axes(s)
# A tibble: 10 x 3
axis variable dimension
<int> <chr> <dbl>
1 1 Total_precipitation_surface_1_Hour_Accumulation 2
2 2 Total_precipitation_surface_1_Hour_Accumulation 3
3 3 Total_precipitation_surface_1_Hour_Accumulation 0
4 4 time_bounds 1
5 5 time_bounds 0
6 6 lon 3
7 7 lon 2
8 8 lat 3
9 9 lat 2
10 10 time 0