`aoa()` appears to return incorrect thresholds (different from Meyer & Pebesma 2021)

Hi all,

Adapting some code from the MEE-AOA repo, I believe I can calculate an AOA like this:

set.seed(123)

library(CAST)
library(caret)
library(virtualspecies)

npoints <- 50
meansPCA <- c(3, -1)
sdPCA <- c(2, 2)
simulateResponse <- c("bio2","bio5","bio10", "bio13", "bio14","bio19")
studyarea <- c(-15, 65, 30, 75)
predictors_global <- raster::brick(
  system.file(
    "extdata/bioclim_global.grd", 
    package = "CAST"
  )
)

predictors <- crop(predictors_global, extent(studyarea))
mask <- predictors[[1]]
values(mask)[!is.na(values(mask))] <- 1
response_vs <- generateSpFromPCA(
  predictors[[simulateResponse]],
  means = meansPCA,
  sds = sdPCA, 
  plot = FALSE
)
response <- response_vs$suitab.raster
mask <- rasterToPolygons(mask,dissolve=TRUE)

samplepoints <- spsample(mask,npoints,"random")
trainDat <- extract(predictors,samplepoints,df=TRUE)
trainDat$response <- extract (response,samplepoints)
trainDat <- trainDat[complete.cases(trainDat),]

model <- train(trainDat[,names(predictors)],
               trainDat$response,
               method="rf",
               importance=TRUE,
               trControl = trainControl(method="none"))

AOA <- aoa(trainDat, model=model)

According to the 2021 paper, I believe the AOA threshold after this should be equal to "the 75-percentile plus 1.5 times the IQR of the DI values of the cross-validated training data". Calculating that using quantile and IQR gives us these results:

di <- attr(AOA$AOA, "TrainDI")

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3059488
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3392091
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6451579

But the AOA threshold returned by aoa() doesn't match that calculation:

AOA$parameters$threshold
#> [1] 0.4770295

If I'm right and this is unexpected, it seems to be due to the use of boxplot.stats() here:

CAST/R/trainDI.R

Line 221 in afcba3f

thres <- grDevices::boxplot.stats(TrainDI)$stats[5]

That gives us the threshold that CAST returns:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

But I'm not entirely sure what boxplot.stats() actually does. For instance, imagine that we cut off the last di value in our vector:

di[50]
#> [1] 0.2120274
di <- di[1:49]

Because it's a rather low number, both our 75% percentile and IQR increase:

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3101567
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3523555
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6625121

But boxplot.stats() returns the same value as before:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

^{Created on 2022-12-11 by the reprex package (v2.0.1)}

Apologies if I'm misunderstanding something here! The return here just didn't match my expectations.

Looking at the source of grDevices::boxplot.stats, I see:

> grDevices::boxplot.stats
function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) 
{
    if (coef < 0) 
        stop("'coef' must not be negative")
    nna <- !is.na(x)
    n <- sum(nna)
    stats <- stats::fivenum(x, na.rm = TRUE)
    iqr <- diff(stats[c(2, 4)])
    if (coef == 0) 
        do.out <- FALSE
    else {
        out <- if (!is.na(iqr)) {
            x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                iqr)
        }
        else !is.finite(x)
        if (any(out[nna], na.rm = TRUE)) 
            stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
    }
    conf <- if (do.conf) 
        stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
    list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
        nna] else numeric())
}

The relevant lines here are:

out <- if (!is.na(iqr)) {
  x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
}
#> [...]
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)

In context, that means that the AOA threshold winds up equaling:

max(di[!(di > (quantile(di, 0.75) + 1.5 * IQR(di)))])

Which means that the threshold is going to be the value in di closest to, but not more than quantile(di, 0.75) + (1.5 * IQR(di)), which, particularly for smaller data, may be a significantly different value. The returned value will always be lower than (or the same as) the 75th percentile plus 1.5 times the IQR. Is this expected behavior?

Thanks for finding that! I fixed it according to your suggestion.