The goal of tidync is to ease exploring the contents of a NetCDF file and constructing efficient queries to extract arbitrary hyperslabs.
The data extracted can be used directly as an array, or in "long form" form as a data frame for "tidy" analysis and visualization contexts.
These examples are for illustration, see the vignettes for more details, and please try on your own sources!
There are two main ways of using tidync.
Use tidync()
and hyper_filter()
to discern what variables and dimensions are available, and to craft axis-filtering expressions by value or by index. (Use the name of the variable on the LHS to target it, use its name to filter by value and the special name index
to filter it by its 'step' index).
## discover the available entities, and the active grid's dimensions and variables
tidync(filename)
## activate a different grid
tidync(filename) %>% activate(grid_identifier)
## get a dimension-focus on the space occupied within a grid
tidync(filename) %>% hyper_filter()
## pass named expressions to subset dimension by value or index (step)
tidync(filename) %>% hyper_filter(lat = lat < -30, time = time == 20)
## select a single variable from a multi-variable file and store this in ggplot compatible format
tidync(filename) %>% hyper_tibble(select_vars = c("varname"))
A grid is a "virtual table" in the sense of a database source. It's possible to activate a grid via a variable within it, so all variables are available by default. Grids have identifiers based on which dimensions they are defined with, so use i.e. "D1,D0" and can otherwise be activated by their count identifier (starting at 1). The "D0" is an identifier, it matches the internal 0-based indexing and identity used by NetCDF itself.
Use what we learned interactively to extract the data, either in data frame or raw-array (hyper slice) form.
## we'll see a column for sst, lat, time, and whatever other dimensions sst has
## and whatever other variable's the grid has
tidync(filename) %>% activate("sst"") %>%
hyper_filter(lat = lat < -30, time = time == 20) %>%
hyper_tibble()
## raw array form, we'll see a (list of) R arrays with a dimension for each seen by tidync(filename) %>% activate("sst"")
tidync(filename) %>% activate("sst"") %>%
hyper_filter(lat = lat < -30, time = time == 20) %>%
hyper_slice()
It's important to not actual request the data extraction until the expressions above would result in an efficient size (don't try a data frame version of a 20Gb ROMs variable ...). Use the interactive modes to determine the likely size of the output you will receive.
There is another function hyper_index
that build the actual index values required by the NetCDF library. This can be used to debug the process or to define your own tools for the extraction. Currently each hyper_*
function can take the filtering expressions, but it's not obvious if this is a good idea or not.
Wishlist items here. Submit your own to the Issues tab
- wrappers for returning various formats, like raster brick, simple features, rgl quad mesh forms, etc.
- delayed extraction to show the tibble you would get after collect()
- consider better function names like
hyper_df
,hyper_dbl
andhyper_int
(differentiate scaled and unscaled?)
Support for tbl_cube is in bare-bones form. See here for an example: http://rpubs.com/cyclemumner/281801
f <- "eclipse.ncdc.noaa.gov/pub/OI-daily-v2/NetCDF/1981/AVHRR/avhrr-only-v2.19810901.nc"
tidync(f) %>% hyper_tbl_cube(lat = lat > -30)
Source: local array [691,200 x 4]
D: lon [dbl, 1440]
D: lat [dbl, 480]
D: zlev [dbl, 1]
D: time [dbl, 1]
M: sst [dbl]
tidync(f) %>% activate("anom"") %>% hyper_tbl_cube(lat = lat > -30)
Source: local array [691,200 x 4]
D: lon [dbl, 1440]
D: lat [dbl, 480]
D: zlev [dbl, 1]
D: time [dbl, 1]
M: anom [dbl]
You can install tidync from github with:
# install.packages("devtools")
devtools::install_github("hypertidy/tidync", dependencies = TRUE)
This is a basic example which shows you how to connect to a file.
file <- system.file("extdata", "oceandata", "S20092742009304.L3m_MO_CHL_chlor_a_9km.nc", package = "tidync")
library(tidync)
tidync(file)
#>
#> Data Source (1): S20092742009304.L3m_MO_CHL_chlor_a_9km.nc ...
#>
#> Grids (4) <dimension family> : <associated variables>
#>
#> [1] D1,D0 : chlor_a **ACTIVE GRID** ( 9331200 values per variable)
#> [2] D3,D2 : palette
#> [3] D0 : lat
#> [4] D1 : lon
#>
#> Dimensions (4):
#>
#> dimension id name length unlim coord_dim
#> <chr> <dbl> <chr> <dbl> <lgl> <lgl>
#> 1 D0 0 lat 2160 FALSE TRUE
#> 2 D1 1 lon 4320 FALSE TRUE
#> 3 D2 2 rgb 3 FALSE FALSE
#> 4 D3 3 eightbitcolor 256 FALSE FALSE
See this article for more: https://hypertidy.github.io/tidync/articles/static-vignettes/tidync-examples.html
Plesase get in touch if you have specfic workflows that tidync
is not providing. There's a lot of room for improvement!
- we can't do "grouped filters"" (i.e. polygon-overlay extraction), but it's in the works
- compound types are not supported, though see the "rhdf5" branch on Github
- NetCDF groups are not exposed (groups are like a "files within a file", analogous to a file system directory)
Here I use the term "slab" as a generalized "array" (in the R sense) that may be read from a NetCDF. We must provide the NetCDF API with a "slab index", i.e. both a start and a count vector each the same length (the same as the number of dimensions as the array variable), that is the only way to read them.
In R terms a 3D array would be indexed like
arr[1:10, 2:12, 3:5]
and that would be analogous to
ncvar_get(con, start = c(1, 2, 3), count = c(10, 11, 3))
If we only wanted a "sparse trace" through the array in R we can do
arr[cbind(c(2, 4), c(5, 6), c(3, 4)]
which would pull out 2-values from 2 arbitrary positions. The API doesn't allow that (at least not in an efficient way that I can understand).
We either have to get the whole "slab" that encompases those 2 cells, or request a degenerate 1-cell slab for each:
ncvar_get(con, start = c(2, 5, 3), count = c(1, 1, 1))
ncvar_get(con, start = c(4, 6, 4), count = c(1, 1, 1))
I've used the term "hyperslab" and "slab" since I realized this basic limitation during earlier work. Unidata use the term but it's not in the API afaik:
http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html
Another term like this is "shape" which is a particular set of dimensions used by 1 or more variables in a file. tidync aims to allow "activation" of a given shape, so that any subsequent extraction gets all the variables that live in that space/shape. (It's a database table interpretation of a set of variables).
In R we can determine those indexes really easily for any given query, tracks over time through XYZ, polygons, boxes and so on - but we are ultimately limited by the API to these "slab" requests, so you see a lot of disparate approaches across packages to optimizing this.
For tidync
we decided to use the term "grid" rather than "shape", and so a grid is the space defined by specific ordered set of dimensions. An "axis" is a particular instance of a dimension within a variable. At times we need to know what grids we have, what variables use those grids, and what axes belong to a variable and in what order. Currently all the facility for determining this information is in package ncmeta
.
The ncmeta
package uses a mix of background packages ncdf4
, RNetCDF
because each has facilities and limitations and both are needed. For closer to full support we can also use rhdf5
or rgdal
which have indpendent wrappers around the core NetCDF library. Groups and compound types are particular challenges, and tidync and ncmeta don't yet exploit all of the currently available facilities. It would still be good to have a single authoratative wrapper, but the details of that aren't widely discussed.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.