/NNDM

Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation

Primary LanguageRGNU General Public License v3.0GPL-3.0

NNDM

Note: An updated version of NNDM with enhanced capabilities has been included in the CAST package (available on CRAN).

The goal of the NNDM package is to provide tools to perform Nearest Neighbour Distance Matching (NNDM) Leave-One-Out (LOO) Cross-Validation (CV) to estimate map accuracy for a given prediction task. In NNDM LOO CV, the nearest neighbour distance distribution function between the test and training data during the LOO CV process is matched to the nearest neighbour distance distribution function between the target prediction and training points, for those distances in which autocorrelation is present. NNDM CV can be used for various prediction areas (geographic interpolation/extrapolation), sampling patterns (regular/random/clustered), and landscape autocorrelation ranges. The package also includes tools to compute buffered-LOO (bLOO) CV.

Reference:

C. Milà, J. Mateu, E. Pebesma, and H. Meyer, ‘Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation’, Methods in Ecology and Evolution (2022).

Installation

You can install the development version from GitHub with:

library("devtools")
devtools::install_github("carlesmila/NNDM")

Example

In this example we will use LOO, bLOO and NNDM LOO CV for map validation in a spatial interpolation task to illustrate the typical use of the package. We will use the ranger implementation of the Random Forest (RF) algorithm wrapped in the caret package, but the CV indices generated by NNDM can also be used with other machine learning packages.

The meuse dataset

We will use the meuse dataset included in the sp package, which includes samples of topsoil zinc concentrations in a flood plain of the river Meuse, in the Netherlands. More information on the dataset can be obtained by using ?meuse.

library("NNDM")
library("caret")
library("sp")
library("sf")
library("knitr")
library("gstat")
library("gridExtra")

# Sample data
data("meuse")
meuse <- st_as_sf(meuse, coords = c("x", "y"), crs = 28992, remove = F)

# AOI polygon
data("meuse.area")
meuse.area <- SpatialPolygons(list(Polygons(list(Polygon(meuse.area)), "area")))
meuse.area <- st_as_sf(meuse.area)
st_crs(meuse.area) <- 28992

# Prediction grid
data("meuse.grid")
meuse.grid <- st_as_sf(meuse.grid, coords = c("x", "y"), crs = 28992, remove = F)

LOO CV

We start by fitting a RF model with the selected predictors (no hyperparameter tuning is done in this example) and validating it with LOO CV, which is readily available in caret.

# Fit model
trainControl_LOO <- trainControl(method = "LOOCV", savePredictions = T)
paramGrid <-  data.frame(mtry = 2, min.node.size = 5, splitrule = "variance")
mod_LOO <- train(zinc ~ x + y + dist + ffreq + soil,
                 method = "ranger",
                 trControl = trainControl_LOO,
                 tuneGrid = paramGrid, 
                 data = meuse, 
                 seed=12345)

Estimating autocorrelation ranges

Both bLOO and NNDM LOO CV require an autocorrelation range parameter to be estimated. There are many proposals on how to estimate the radius of bLOO/NNDM LOO CV, being the most frequent the residual or outcome autocorrelation ranges (see our article for more details). Here we estimate the residual autocorrelation range:

# Estimate variogram on the residual and return range
meuse$res <- meuse$zinc - predict(mod_LOO, newdata=meuse)
empvar <- variogram(res~1, data = meuse)
fitvar <- fit.variogram(empvar, vgm(model="Sph", nugget = T))
plot(empvar, fitvar, cutoff=1500, main = "Residual semi-variogram estimation")

(resrange <- fitvar$range[2]) # Residual autocorrelation range in m
#> [1] 842.0515

bLOO CV

With the estimated range, we can now compute the bLOO CV indices using the bLOO function. We print the object and observe that with that radius, the average training size of each bLOO iteration is of roughly 110 points. We can also plot the object for a given observation and explore which observations are used for training, testing, or are excluded. We use these indices in a caret model to perform a bLOO CV.

# Compute bLOO indices
(bLOO_indices <- bLOO(meuse, resrange, min_train = 0.5))
#> bLOO object
#> Total number of points: 155
#> Mean number of training points: 110.01
#> Minimum number of training points: 92
#> Mean buffer radius: 842.05
# Plot for one CV iteration
plot(bLOO_indices, 53) +
  theme(axis.text = element_blank(), axis.ticks = element_blank())

# Evaluate RF model using bLOO CV
trainControl_bLOO <- trainControl(method = "cv", savePredictions = T,
                                  index=bLOO_indices$indx_train,
                                  indexOut=bLOO_indices$indx_test)
mod_bLOO <- train(zinc ~ x + y + dist + ffreq + soil,
                  method = "ranger",
                  trControl = trainControl_bLOO,
                  tuneGrid = paramGrid, 
                  data = meuse, 
                  seed=12345)

NNDM LOO CV

Similarly, we compute NNDM LOO CV indices using the function nndm. When plotting the object, the estimated $G$ functions are shown. In this case, the nearest neighbour distance distribution function found during LOO CV $\hat{G}j(r)$ is smaller than the nearest neighbour distance distribution found during prediction $\hat{G}{i , j}(r)$ for short distances, and is slightly larger for long distances (see our related paper for definitions of $G$ functions and more detail on their interpretation). $\hat{G}_{j}^{*} (r, \mathbf{L})$, i.e. the nearest neighbour distance distribution during NNDM LOO CV, overlaps with $\hat{G}j(r)$ for short distances but matches $\hat{G}{i , j}(r)$ for long distances. In any case, by printing the object we see that, in this case, the difference between LOO CV and NNDM CV is minimal, as on average 153.9 points out of 154 are used for training in each NNDM LOO CV iteration.

# Compute NNDM indices
(NNDM_indices <- nndm(meuse, meuse.grid, resrange, min_train = 0.5))
#> nndm object
#> Total number of points: 155
#> Mean number of training points: 153.88
#> Minimum number of training points: 150
# Plot NNDM functions
plot(NNDM_indices)

# Evaluate RF model using NDM CV
trainControl_NNDM <- trainControl(method = "cv", savePredictions = T,
                                  index=NNDM_indices$indx_train,
                                  indexOut=NNDM_indices$indx_test)
mod_NNDM <- train(zinc ~ x + y + dist + ffreq + soil,
                  method = "ranger",
                  trControl = trainControl_NNDM,
                  tuneGrid = paramGrid, 
                  data = meuse, 
                  seed=12345)

Comparing results

We compute the different CV results by manually computing the statistics for bLOO and NNDM (caretresults for LOO customised folds are not appropriate):

# Compute and return results
stats_LOO <- data.frame(validation="LOO CV",
                        RMSE=mod_LOO$results$RMSE, 
                        R2=mod_LOO$results$Rsquared)
stats_bLOO <- data.frame(validation="bLOO CV",
                         RMSE=with(mod_bLOO$pred, sqrt(mean((obs-pred)^2))), 
                         R2=with(mod_bLOO$pred, cor(obs, pred)^2))
stats_NNDM <- data.frame(validation="NNDM LOO CV",
                         RMSE=with(mod_NNDM$pred, sqrt(mean((obs-pred)^2))), 
                         R2=with(mod_NNDM$pred, cor(obs, pred)^2))
validation RMSE R2
LOO CV 202.13 0.71
bLOO CV 288.36 0.42
NNDM LOO CV 203.43 0.70

In this example, we see that LOO and NNDM LOO CV return very similar results while bLOO CV results in a much lower estimated map accuracy.