r-spatial/stars

netcdf dimensions get messed up if they are out of order and if axis attributes are available

Opened this issue ยท 12 comments

Function .get_dims() sorts dimensions according to the canonical axis order if "axis" attributes are provided by the netcdf.

However, this doesn't seem to work as expected if the dimensions in the netcdf are not already sorted. This seems to be caused by line 521 of file R/ncdf.R (https://github.com/r-spatial/stars/blob/f0156c9266ba16a8c4efc8da9702f053e98c871a/R/ncdf.R#L521C5-L521C48) which uses

  `dims` <- dims[match(dims$id, dim_matcher), ]`

instead of

  `dims <- dims[match(dim_matcher, dims$id, nomatch = 0L), ]`

(or similar).

It would be great if stars could read such netcdfs, thanks!

packageVersion("stars")
#> [1] '0.6.5'
packageVersion("ncmeta")
#> [1] '0.4.0'
packageVersion("sf")
#> [1] '1.0.16'

# Demonstration
dim_matcher <- c(1L, 0L, 3L)

dims <- data.frame(
  id = c(3L, 1L, 0L),
  name = c("time", "lon", "lat"),
  length = c(9, 2, 3),
  unlim = c(FALSE, FALSE, FALSE),
  coord_dim = c(TRUE, TRUE, TRUE),
  coord_var = c("time", "lon", "lat"),
  axis = c("T", "X", "Y")
)

# Existing implementation: axis order ends up as Y - T - X
dims[match(dims$id, dim_matcher), ]
#>   id name length unlim coord_dim coord_var axis
#> 3  0  lat      3 FALSE      TRUE       lat    Y
#> 1  3 time      9 FALSE      TRUE      time    T
#> 2  1  lon      2 FALSE      TRUE       lon    X

# Suggested fix: axis order is now X - Y - T
dims[match(dim_matcher, dims$id, nomatch = 0L), ]
#>   id name length unlim coord_dim coord_var axis
#> 2  1  lon      2 FALSE      TRUE       lon    X
#> 3  0  lat      3 FALSE      TRUE       lat    Y
#> 1  3 time      9 FALSE      TRUE      time    T


# Example with two netCDFs
tas <- array(
  data = rowSums(
    expand.grid(seq_len(9), 10 * (seq_len(2) - 1), 100 * (seq_len(3) - 1))
  ),
  dim = c(9, 2, 3)
)

# File1 has no "axis" attributes
file1 <- tempfile("tas_example_", fileext = ".nc")
nc <- RNetCDF::create.nc(file1)

id_lat <- RNetCDF::dim.def.nc(nc, "lat", 3)
iv_lat <- RNetCDF::var.def.nc(nc, "lat", "NC_FLOAT", id_lat)
RNetCDF::var.put.nc(nc, "lat", c(40, 45, 50))

id_lon <- RNetCDF::dim.def.nc(nc, "lon", 2)
iv_lon <- RNetCDF::var.def.nc(nc, "lon", "NC_FLOAT", id_lon)
RNetCDF::var.put.nc(nc, "lon", c(-100, -95))

id_bnds <- RNetCDF::dim.def.nc(nc, "bnds", 2)

id_time <- RNetCDF::dim.def.nc(nc, "time", 9)
iv_time <- RNetCDF::var.def.nc(nc, "time", "NC_INT", id_time)
RNetCDF::var.put.nc(nc, "time", 1:9)

iv_tas <- RNetCDF::var.def.nc(nc, "temperature", "NC_FLOAT", c(id_time, id_lon, id_lat))
RNetCDF::var.put.nc(nc, "temperature", tas)

RNetCDF::close.nc(nc)


# File2 has X, Y, T axis attributes
file2 <- tempfile("tas_example_", fileext = ".nc")
file.copy(from = file1, to = file2)
#> [1] TRUE
nc <- RNetCDF::open.nc(file2, write = TRUE)

RNetCDF::att.put.nc(nc, "lon", "axis", "NC_CHAR", "X")
RNetCDF::att.put.nc(nc, "lat", "axis", "NC_CHAR", "Y")
RNetCDF::att.put.nc(nc, "time", "axis", "NC_CHAR", "T")

RNetCDF::close.nc(nc)


# Read example files with `read_ncdf()`
stars::read_ncdf(file1) # Works well (no "axis" attributes -- `dim_matcher` is all NA)
#> no 'var' specified, using temperature
#> other available variables:
#>  lat, lon, time
#> Will return stars object with 54 cells.
#> Warning in .get_nc_projection(meta$attribute, rep_var, cv): No projection
#> information found in nc file.
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#>              Min. 1st Qu. Median Mean 3rd Qu. Max.
#> temperature     1   15.25    110  110  204.75  219
#> dimension(s):
#>      from to offset delta x/y
#> time    1  9    0.5     1 [x]
#> lon     1  2 -102.5     5 [y]
#> lat     1  3   37.5     5

stars::read_ncdf(file2) # Doesn't work as expected: dimensions are messed up
#> no 'var' specified, using temperature
#> other available variables:
#>  lat, lon, time
#> Will return stars object with 54 cells.
#> Warning in .get_nc_time(dimensions, make_time, coord_var, rep_var, meta):
#> ignoring units of time dimension, not able to interpret

#> Warning in .get_nc_time(dimensions, make_time, coord_var, rep_var, meta): No
#> projection information found in nc file.
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#>              Min. 1st Qu. Median Mean 3rd Qu. Max.
#> temperature     1   15.25    110  110  204.75  219
#> dimension(s):
#>      from to offset delta x/y
#> lat     1  9   37.5     5 [x]
#> time    1  2      1     1 [y]
#> lon     1  3 -102.5     5

# Clean up
unlink(file1)
unlink(file2)

Created on 2024-04-17 with reprex v2.1.0

Nice ;-)

We also have:

(x1 = stars::read_ncdf(file2))
# no 'var' specified, using temperature
# other available variables:
#  lat, lon, time
# Will return stars object with 54 cells.
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#              Min. 1st Qu. Median Mean 3rd Qu. Max.
# temperature     1   15.25    110  110  204.75  219
# dimension(s):
#      from to offset delta x/y
# lat     1  9   37.5     5 [x]
# time    1  2      1     1 [y]
# lon     1  3 -102.5     5    
# Warning messages:
# 1: In .get_nc_time(dimensions, make_time, coord_var, rep_var, meta) :
#   ignoring units of time dimension, not able to interpret
# 2: In .get_nc_projection(meta$attribute, rep_var, cv) :
#   No projection information found in nc file.
(x2 = stars::read_stars(file2))
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#                               Min. 1st Qu. Median Mean 3rd Qu. Max.
# tas_example_5cbe857ee56c4.nc     1   15.25    110  110  204.75  219
# dimension(s):
#     from to offset delta x/y
# x      1  9      0     1 [x]
# y      1  2      2    -1 [y]
# lat    1  3     40     5    
(x3 = stars::read_mdim(file2))
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#              Min. 1st Qu. Median Mean 3rd Qu. Max.
# temperature     1   15.25    110  110  204.75  219
# dimension(s):
#      from to offset delta x/y
# time    1  9    0.5     1 [x]
# lon     1  2 -102.5     5 [y]
# lat     1  3     40     5    
identical(as.vector(x1[[1]]), as.vector(x2[[1]]))
# [1] FALSE
identical(as.vector(x1[[1]]), as.vector(x3[[1]]))
# [1] TRUE

But also

identical(as.vector(x1[[1]]), as.vector(x2[[1]][,2:1,]))
# [1] TRUE

so that is because GDAL reverts the y axis direction; and

stars::read_mdim(file2, raster = c("lon", "lat"))
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#              Min. 1st Qu. Median Mean 3rd Qu. Max.
# temperature     1   15.25    110  110  204.75  219
# dimension(s):
#      from to offset delta x/y
# time    1  9    0.5     1    
# lon     1  2 -102.5     5 [x]
# lat     1  3     40     5 [y]

where offset is ok for lon, but not for time and lat.

@mdsumner or @dblodgett-usgs could one of you take a look at the .get_dims() issue?

I think this is all good -- thanks for the thorough report @dschlaep !! I incorporated your reprex into a test and made the change.

Question for @edzer I'm looking at the print.dimensions method and tracked to here: https://github.com/r-spatial/stars/blob/main/R/dimensions.R#L658

For the base example (without the axis attribute) we have:

netcdf tas_example_4ed43eafb7d {
dimensions:
        lat = 3 ;
        lon = 2 ;
        bnds = 2 ;
        time = 9 ;
variables:
        float lat(lat) ;
        float lon(lon) ;
        int time(time) ;
        float temperature(lat, lon, time) ;
data:

 lat = 40, 45, 50 ;

 lon = -100, -95 ;

 time = 1, 2, 3, 4, 5, 6, 7, 8, 9 ;

 temperature =
  1, 2, 3, 4, 5, 6, 7, 8, 9,
  11, 12, 13, 14, 15, 16, 17, 18, 19,
  101, 102, 103, 104, 105, 106, 107, 108, 109,
  111, 112, 113, 114, 115, 116, 117, 118, 119,
  201, 202, 203, 204, 205, 206, 207, 208, 209,
  211, 212, 213, 214, 215, 216, 217, 218, 219 ;
}

In this example, there's nothing to go on to figure out that the 'lat' and 'lon' variables should be interpreted as latitude and longitude, so the [x] and [y] labels aren't right. If I add standard_name and units it still gets the axis labels wrong.

netcdf tas_example_4ed4569a6c6 {
dimensions:
        lat = 3 ;
        lon = 2 ;
        bnds = 2 ;
        time = 9 ;
variables:
        float lat(lat) ;
                lat:standard_name = "latidude" ;
                lat:units = "degrees" ;
        float lon(lon) ;
                lon:standard_name = "longitude" ;
                lon:units = "degrees" ;
        int time(time) ;
                time:standard_name = "time" ;
                time:units = "days since 1900-01-01" ;
        float temperature(lat, lon, time) ;
data:

 lat = 40, 45, 50 ;

 lon = -100, -95 ;

 time = 1, 2, 3, 4, 5, 6, 7, 8, 9 ;

 temperature =
  1, 2, 3, 4, 5, 6, 7, 8, 9,
  11, 12, 13, 14, 15, 16, 17, 18, 19,
  101, 102, 103, 104, 105, 106, 107, 108, 109,
  111, 112, 113, 114, 115, 116, 117, 118, 119,
  201, 202, 203, 204, 205, 206, 207, 208, 209,
  211, 212, 213, 214, 215, 216, 217, 218, 219 ;
}

I see:

stars object with 3 dimensions and 1 attribute
attribute(s):
             Min. 1st Qu. Median Mean 3rd Qu. Max.
temperature     1   15.25    110  110  204.75  219
dimension(s):
     from to         offset  delta x/y
time    1  9 1900-01-02 UTC 1 days [x]
lon     1  2         -102.5      5 [y]
lat     1  3           37.5      5    

Should the dimension order be returning in x, y, t order here since we know what the order is? I'm don't remember how many places in stars this assumption is being made vs doing a little more introspection to figure out what the x and y axes are.

@dblodgett-usgs thanks!

If I add standard_name and units it still gets the axis labels wrong.

If you change latidude into latitude, we get

> stars::read_ncdf("x.nc")
no 'var' specified, using temperature
other available variables:
 lat, lon, time
Will return stars object with 54 cells.
No projection information found in nc file. 
 Coordinate variable units found to be degrees, 
 assuming WGS84 Lat/Lon.
stars object with 3 dimensions and 1 attribute
attribute(s):
             Min. 1st Qu. Median Mean 3rd Qu. Max.
temperature     1   15.25    110  110  204.75  219
dimension(s):
     from to         offset  delta         refsys x/y
lon     1  3         -102.5      5 WGS 84 (CRS84) [x]
lat     1  9           37.5      5 WGS 84 (CRS84) [y]
time    1  2 1900-01-02 UTC 1 days        POSIXct    

which gets the names right, but the dimensions wrong (lon should be 2, lat 3, time 9).

This adds some dim name matching to read_mdim():

stars::read_mdim("x.nc")
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#              Min. 1st Qu. Median Mean 3rd Qu. Max.
# temperature     1   15.25    110  110  204.75  219
# dimension(s):
#      from to     offset  delta         refsys x/y
# time    1  9 1900-01-02 1 days           Date    
# lon     1  2     -102.5      5 WGS 84 (CRS84) [x]
# lat     1  3         40      5 WGS 84 (CRS84) [y]
stars::read_mdim("x.nc") |> aperm(c(2,3,1))
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#              Min. 1st Qu. Median Mean 3rd Qu. Max.
# temperature     1   15.25    110  110  204.75  219
# dimension(s):
#      from to     offset  delta         refsys x/y
# lon     1  2     -102.5      5 WGS 84 (CRS84) [x]
# lat     1  3         40      5 WGS 84 (CRS84) [y]
# time    1  9 1900-01-02 1 days           Date    

@edzer -- see one more minor PR updating the test.

Still, read_ncdf gets the dimensions wrong:

     from to         offset  delta         refsys x/y
lon     1  3         -102.5      5 WGS 84 (CRS84) [x]
lat     1  9           37.5      5 WGS 84 (CRS84) [y]
time    1  2 1900-01-02 UTC 1 days        POSIXct    

Sorry, what do you see that's wrong there?

time should range from 1 to 9, lat from 1 to 3, lon from 1 to 2.

(note to self: read_mdim() gets the lat offset wrong)

I see -- ok. I'll get in and see where that went wrong as soon as I can.