f_diss and fDiss
AlexandreWadoux opened this issue · 1 comments
With the new change in the f_diss function I obtain a different results by running my code.
# compute Mahalanobis distance between scores centre and the scores of the spectra
wmahald <- f_diss(Xr = pcspectraA$scores,
Xu = pcspectraACentre,
diss_method = 'mahalanobis',
center = FALSE, scale = FALSE)
and the plot:
# plot the index of the spectra against the Mahalanobis distance
plot(wmahald,
pch = 16,
col = rgb(red = 0, green = 0.4, blue = 0.8, alpha = 0.5),
ylab = 'Mahalanobis distance')
# add a horizontal line to better visualize the spectra with Mahalanobis dissimilarity scores larger than 1 (arbitrary threshold)
abline(h = 1, col = 'red')
now it gives me a much larger distance:
The scale of the results retrieved is now different than the one in the previous version.
This come from a known bug in the scaling of the final results
(as reported in the NEWS file).
The distance ratios (between samples) were correctly calculated, but the final
scaling of the results was not properly done. The distance between Xi and Xj
were scaled by taking the squared root of the mean of the squared differences
and dividing it by the number of variables i.e. sqrt(mean((Xi-Xj)^2))/ncol(Xi),
however the correct calculation is done by taking the mean of the squared
differences, dividing it by the number of variables and then compute the squared
root i.e. sqrt(mean((Xi-Xj)^2)/ncol(Xi)). This bug had no effect on the
computations of the nearest neighbors.
The following code might help to understand how the scaling is now done:
library(prospectr)
data(NIRsoil)
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]
# Mahalanobis distance computed on the first 20 spectral variables
n_variables <- 20
# resemble
md <- f_diss(
Xr[, 1:n_variables],
Xr[1, 1:n_variables, drop = FALSE],
"mahalanobis",
center = FALSE
)
# rstats
md_r <- mahalanobis(
Xr[, 1:n_variables],
center = Xr[1, 1:n_variables, drop = FALSE],
cov = cov(Xr[, 1:n_variables])
)
md_r <- sqrt((md_r)/n_variables) # scaling using the number of variables
plot(md, md_r)