Issue with Mahalanobis distance computation
Closed this issue · 6 comments
Hi there,
R returns the following when running ellipseParam
on a spectral dataset: Error in solve.default(cov, ...) : system is computationally singular: reciprocal condition number = 1.37309e-48
As I understood it, this is related to the value of the tol
parameter of the mahalanobis
function. My question is: do you have any suggestion on how to adjust this tol
parameter within ellipseParam
?
Thanks!
Julien
Hi @morelju
The error message you're encountering indicates that the covariance matrix is computationally singular, which means it's nearly impossible to invert accurately. This issue often arises, for instance, in spectral datasets due to their high dimensionality and multicollinearity. Let me break down the problem and offer some solutions:
(1) Cause of the error:
- The
ellipseParam()
function usesstats::mahalanobis()
, which in turn usessolve(stats::cov(x))
to invert the covariance matrix.x
being your input data. - The
solve()
function uses a tolerance value (tol
) to detect linear dependency in the input matrix. In your case, the defaulttol = .Machine$double.eps
, which is1.37309e-48
, has been reached, triggering the error message.
(2) Reasons for singularity:
- Multicollinearity in data (especially common in spectral datasets)
- Variables with near-zero variance
- Poorly conditioned matrix (number of variables significantly exceeds the number of observations)
(3) Recommended solutions:
Apply Principal Component Analysis (PCA) to your dataset before using ellipseParam()
. Consider other dimensionality reduction techniques (PLS, ICA, etc...) if appropriate for your project.
Hi @ChristianGoueguel
Thanks a lot for your explanations! I actually got the error message after applying the PCA - I tried to run the ellipseParam()
on the scores. I will try a couple of other dimensionality reduction and see if this solves my problem. I would close the comment once a solution is found, if that's ok with you.
Thanks again,
Julien
Hi again @ChristianGoueguel ,
In the end things work if I drop some PCs before using ellipseParam()
, which was a bit counter-intuitive for me as I would have thought that the k
parameter of ellipseParam()
would handle this. Not sure if I misunderstood it or if this is a bug in your code?
Hi @morelju,
As I mentioned in my previous comment, the error message originates from the covariance matrix inversion using the solve()
function. The resolution you found by dropping a few principal components before using ellipseParam()
confirms that the issue stemmed from principal components in your scores dataset with very low variance. Retaining only the principal components that effectively summarize your dataset is indeed a good practice.
To document and illustrate the error you encountered, I've conducted a few tests:
- Loading the spectral dataset:
data("specData", package = "HotellingEllipse")
- Running PCA on the spectral dataset:
pca_mod <- specData |>
dplyr::select(tidyselect::where(is.numeric)) |>
FactoMineR::PCA(scale.unit = FALSE, graph = FALSE)
- Retrieving the PCA scores:
pca_scores <- pca_mod |>
purrr::pluck("ind", "coord") |>
tibble::as_tibble()
print(pca_scores)
# A tibble: 100 × 5
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 25306. -10831. -1851. -83.4 -560.
2 -67.3 1137. -2946. 2495. -568.
3 -1822. -22.0 -2305. 1640. -409.
4 -1238. 3734. 4039. -2428. 379.
5 3299. 4727. -888. -1089. 262.
6 5006. -49.5 2534. 1917. -970.
7 -8325. -5607. 960. -3361. 103.
8 -4955. -1056. 2510. -397. -354.
9 -1610. 1271. -2556. 2268. -760.
10 19582. 2289. 886. -843. 1483.
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows
As you can see, pca_scores
has 5 components and is used as an example in the package README
.
To demonstrate the error conditions, I've added a sixth column, Dim.6
, and conducted two tests:
- Test 1:
Dim.6
with zero variance.
n <- nrow(pca_scores)
pca_scores <- pca_scores |> dplyr::mutate(Dim.6 = rep(1, n))
HotellingEllipse::ellipseParam(pca_scores)
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[6,6] = 0
- Test 2:
Dim.6
with near-zero variance.
tol <- .Machine$double.eps
x <- stats::rnorm(n, mean = 0, sd = sqrt(tol))
pca_scores <- pca_scores |> dplyr::mutate(Dim.6 = x * (sqrt(tol) / stats::sd(x)))
HotellingEllipse::ellipseParam(pca_scores)
Error in solve.default(cov, ...) :
system is computationally singular: reciprocal condition number = 5.2279e-24
Key points to note:
- The errors occur when dealing with zero or near-zero variance components.
- These tests simulate the conditions that likely caused your original error.
- The solution involves retaining only high-variance principal components, which highlights why your solution of dropping low-variance components was effective.
Potential future improvements:
Both tests were conducted using the default parameters, which means the Mahalanobis distance is computed before the selection of the number of component k
; leading to issues when low-variance components are present in the dataset. I could modify the ellipseParam
function so that users specify k
components before calculating the Mahalanobis distance and add a variance threshold or warning message to avoid the risk of including low-variance components that could lead to singularity issues.
If you have any questions or need further clarification, please don't hesitate to ask.
Otherwise, I'll proceed to close this issue.
Thank you for bringing this to my attention,
Hi @ChristianGoueguel
All good with me now, thanks again for your clarifications!
An updated version (v1.2.0) is available on CRAN.