Performance issue when using NIPALS with missing values
Closed this issue · 2 comments
The handling of missing values in the NIPALS algorithm is suboptimal, causing a significant 4 times slow down. For example:
library(ropls)
data(sacurine)
attach(sacurine)
genderFc <- sampleMetadata[, "gender"]
start <- Sys.time()
sacurine.plsda <- opls(dataMatrix, genderFc)
end <- Sys.time()
time_no_missing = end - start
time_no_missing
# Time difference of 0.6366529 secs
# add a single missing value
dataMatrix[1, 1] = NA
start <- Sys.time()
sacurine.plsda <- opls(dataMatrix, genderFc)
end <- Sys.time()
time_missing = end - start
time_missing
# Time difference of 2.936431 secs
as.numeric(time_missing) / as.numeric(time_no_missing)
# 4.61229420038677
I have profiled the code and determined that this issue arises from the repetitive use of complete.cases
on the temporary matrices' rows and columns, which is unnecessary.
Attempt 1: precompute the missingness status
I originally tried to precompute them only once with x_is_not_na = unname(!is.na(xcvTraMN))
, and then replace calls like:
comVl <- complete.cases(xcvTraMN[i, ])
with:
comVl <- x_is_not_na[i, ]
but it did not improve performance during the initial trials; this is probably because there are many matrices (weights/loadings/scores) and not a single matrix; while for the X/temporary X it does help to avoid some missingness checks, for others it only increases the memory use which also has an overhead.
Attempt 2: do not check missingness, calculate cross product manually
However, I also noticed that the use of crossprod
is not necessary in the loops used to handle missing values, because the code no longer operates on the entire matrix, but on individual rows instead; this allows to replace expensive subsetting of the matrix aimed at removing missingness and the crossprod
calls with simple multiplication and sum()
with na.rm=TRUE
. For example, in:
wVn <- numeric(ncol(xcvTraMN))
for(j in 1:ncol(xcvTraMN)) {
comVl <- complete.cases(xcvTraMN[, j]) & complete.cases(uOldVn)
wVn[j] <- crossprod(xcvTraMN[comVl, j], uOldVn[comVl]) / drop(crossprod(uOldVn[comVl]))
}
the loop body can be replaced with:
wVn[j] <- sum(xcvTraMN[, j] * uOldVn, na.rm=TRUE) / sum(uOldVn^2, na.rm=TRUE)
which is 2.6 times faster for the case of having a single missing value (tested on 100000 iterations).
If applied through the code base, all methods can be accelerated, and the effect persists regardless of the fraction of missing values:
@SamGG, @ethevenot is this project still maintained? If yes, I would be happy to submit a pull request with a patch for this and several other issues I found, including
- captions of some figures in vignette are swapped
- if there are too many missing values in any of the variables the computation errors due to cross-validation splitting not checking for missingness)
Would you have time to review any PRs addressing the above and to release a new version?
Hi,
Thanks for your feedback. You reached my repo because I wanted to have a look at this code and maybe tune it. I am not the developer in any way. Etienne is still maintaining this package as it is on Bioconductor. I think he will appreciate your contribution. I hope he will answer within a few days.
Best.
Closing because no more activiy.