bwlewis/irlba

sparse matrices and prcomp_irlba

simon-anders opened this issue · 5 comments

While 'irlba' can work with sparse matrices, 'prcomp_irlba' cannot. All sparse matrices get coerced to dense ones, due to the line if (!is.matrix(x)) x <- as.matrix(x)

Hence, maybe the man page should not state that A can be sparse.

BTW, would it be difficult to make prcomp_irlba leverage the support for sparse matrices of the irlba function?

Marking this a bug.

Hmmm. Still a bug--working on this. Here is a buggy example:

x <- spMatrix(10,20, i = c(1,3:8),  j = c(2,9,6:10),  x = 7 * (1:7))
prcomp(x)  # OK
prcomp_irlba(x)  # fails with:
## Error in W[, j_w] <- W[, j_w] - ds * drop(cross(dv, VJ)) * du : 
##  replacement has length zero
LTLA commented

The failure is caused by a NULL value for du. It seems like we should probably have:

    if (deflate && is.null(du)) du <- rep(1, nrow(A))

... before the start of the slow-path code, e.g., around lines 431-455. I'm assuming that its value is meant to be 1 - I'll admit that the purpose of this variable is not clear to me.

Just a comment for the record, here and also for issue #34.

  • prcomp_irlba is just a wrapper that sets up irlba with the required inputs to run PCA.
  • irlba should work with all sparse matrix classes, although the dgCMatrix class uses a fast code path
  • irlba never converts a sparse input matrix A into a dense one, except in one special case when A is very tiny.

Now, the output of irlba, and prcomp_irlba, is dense. So we know that the output will consume at least nu * nrow(A) * 8 + nv * nrow(A) * 8 + work * nrow(A) * 8 + nv * 8 bytes of system memory, where the last two terms are internal storage used by the algorithm. The value work defaults to nv + 7, but can be configured by the user. There is also some additional internal memory overhead, so this is a minimum amount of memory required to comptue the output.

Since we have at the start this good estimate of output memory required, nu * nrow(A) * 8 + nv * nrow(A) * 8 + work * nrow(A) * 8 + nv * 8, I think it's reasonable to add a memory check as requested by issue #34. As @LTLA points out, it's possible for R's garbage collector--or other programs for that matter--to free up memory over the course of the computation, so this check should probably be a warning. But a warning that is displayed immediately instead of the default deferred warning behavior.

I'll put something like that in there today.

Just a few follow up comments on this:

  1. I understated memory use, it's roughly twice what I said above.
  2. Determining available memory is not easy to do in R in a platform-independent way.

Because of 2 above, I'm not putting in a priori memory warnings just yet. However, as a consolation, nearly all memory is allocated at the start of the method, before computation begins, and all allocations use R's internal allocator. That means that, although the method will attempt to, say, allocate too much memory for the result it should at least fail gracefully.

But because of 1 above I think I need to add to the vignette on the aglorithm internals and how memory is used--and while I'm at it about the two code paths inside the package and which types of matrices work with each. I will prepare that offline for the next version.