SamGG/ropls

Performance issue when using NIPALS with missing values

Closed this issue · 2 comments

The handling of missing values in the NIPALS algorithm is suboptimal, causing a significant 4 times slow down. For example:

library(ropls)

data(sacurine)
attach(sacurine)
genderFc <- sampleMetadata[, "gender"]

start <- Sys.time()
sacurine.plsda <- opls(dataMatrix, genderFc)
end <- Sys.time()
time_no_missing = end - start
time_no_missing
# Time difference of 0.6366529 secs

# add a single missing value
dataMatrix[1, 1] = NA

start <- Sys.time()
sacurine.plsda <- opls(dataMatrix, genderFc)
end <- Sys.time()
time_missing = end - start
time_missing
# Time difference of 2.936431 secs

as.numeric(time_missing) / as.numeric(time_no_missing)
# 4.61229420038677

I have profiled the code and determined that this issue arises from the repetitive use of complete.cases on the temporary matrices' rows and columns, which is unnecessary.

Attempt 1: precompute the missingness status

I originally tried to precompute them only once with x_is_not_na = unname(!is.na(xcvTraMN)), and then replace calls like:

comVl <- complete.cases(xcvTraMN[i, ])

with:

comVl <- x_is_not_na[i, ]

but it did not improve performance during the initial trials; this is probably because there are many matrices (weights/loadings/scores) and not a single matrix; while for the X/temporary X it does help to avoid some missingness checks, for others it only increases the memory use which also has an overhead.

Attempt 2: do not check missingness, calculate cross product manually

However, I also noticed that the use of crossprod is not necessary in the loops used to handle missing values, because the code no longer operates on the entire matrix, but on individual rows instead; this allows to replace expensive subsetting of the matrix aimed at removing missingness and the crossprod calls with simple multiplication and sum() with na.rm=TRUE. For example, in:

wVn <- numeric(ncol(xcvTraMN))
for(j in 1:ncol(xcvTraMN)) {
    comVl <- complete.cases(xcvTraMN[, j]) & complete.cases(uOldVn)
    wVn[j] <- crossprod(xcvTraMN[comVl, j], uOldVn[comVl]) / drop(crossprod(uOldVn[comVl]))
}

the loop body can be replaced with:

   wVn[j] <- sum(xcvTraMN[, j] * uOldVn, na.rm=TRUE) / sum(uOldVn^2, na.rm=TRUE)

which is 2.6 times faster for the case of having a single missing value (tested on 100000 iterations).

If applied through the code base, all methods can be accelerated, and the effect persists regardless of the fraction of missing values:

image

@SamGG, @ethevenot is this project still maintained? If yes, I would be happy to submit a pull request with a patch for this and several other issues I found, including

  • captions of some figures in vignette are swapped
  • if there are too many missing values in any of the variables the computation errors due to cross-validation splitting not checking for missingness)

Would you have time to review any PRs addressing the above and to release a new version?

SamGG commented

Hi,
Thanks for your feedback. You reached my repo because I wanted to have a look at this code and maybe tune it. I am not the developer in any way. Etienne is still maintaining this package as it is on Bioconductor. I think he will appreciate your contribution. I hope he will answer within a few days.
Best.

SamGG commented

Closing because no more activiy.