RcppStatGen is an R/RcppEigen package with code snippets for Statistical Genetics tasks.
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("variani/RcppStatGen")
library(RcppStatGen)
library(magrittr) # for %>%
eigen_version()
Basic example of scaling a numeric matrix.
iris[, -5] %>% head %>% as.matrix %>% eigen_scale
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5206576 0.3401105 -0.3627381 -0.4082483
#> [2,] -0.1735525 -1.1175060 -0.3627381 -0.4082483
#> [3,] -0.8677627 -0.5344594 -1.0882144 -0.4082483
#> [4,] -1.2148677 -0.8259827 0.3627381 -0.4082483
#> [5,] 0.1735525 0.6316338 -0.3627381 -0.4082483
#> [6,] 1.5619728 1.5062037 1.8136906 2.0412415
Example of scaling a matrix with missing data.
matrix(1:9, 3, 3) %>% eigen_scale_naive
#> [,1] [,2] [,3]
#> [1,] -1 -1 -1
#> [2,] 0 0 0
#> [3,] 1 1 1
matrix(c(NA, 2:9), 3, 3) %>% eigen_scale_naive
#> [,1] [,2] [,3]
#> [1,] NA -1 -1
#> [2,] NA 0 0
#> [3,] NA 1 1
matrix(c(NA, 2:9), 3, 3) %>% eigen_scale
#> [,1] [,2] [,3]
#> [1,] NA -1 -1
#> [2,] -0.7071068 0 0
#> [3,] 0.7071068 1 1
The standard R function scale
requires ~10x more RAM than its Eigen
implementations. Eigen implementations are also several times faster.
library(bench)
fun_bench <- function(n = 1e4, p = 1e3)
{
X <- matrix(rnorm(n*p), n, p)
bench::mark(scale(X), eigen_scale_naive(X), eigen_scale(X), check = FALSE, relative = TRUE)
}
fun_bench()
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 scale(X) 6.44 6.43 1 10.5 2.40
#> 2 eigen_scale_naive(X) 1 1 6.28 1 1.01
#> 3 eigen_scale(X) 1.25 1.29 5.00 1 1
Scaling matrix in-place requires almost no extra RAM (just needed to store means/sds for each column). Thus, the memory footprint is reduced by a factor of n = 1e4 (the number of rows).
This exercise was motivated by this SO question. In general, one expects from scaling function to take one input matrix of raw data (to keep untouched) and create another new matrix of per-column scaled data.
fun_bench_inplace<- function(n = 1e4, p = 1e3)
{
X <- matrix(rnorm(n*p), n, p)
bench::mark(eigen_scale(X), eigen_scale_inplace(X), check = FALSE)
}
fun_bench_inplace()
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 eigen_scale(X) 224.4ms 224.4ms 4.46 76.3MB 8.91
#> 2 eigen_scale_inplace(X) 66.2ms 66.6ms 14.8 5.02KB 0
The extra RAM used by eigen_scale
function (compared to
eigen_scale_inplace
) is equal to the size of matrix X
(76Mb).
n <- 1e4
p <- 1e3
X <- matrix(rnorm(n*p), n, p)
object.size(X) %>% print(units = "auto")
#> 76.3 Mb
rm(X)
- Getting Q matrix from
Eigen::HouseholderQR
needs thinning. Code example. CodeA_ortho = A.householderQr().householderQ();
link will return a matrix Q with the number of columns different from the matrix rank. - QR Decomposition results in eigen library differs from Matlab:
Note that the different underlying matrix packages (LAPACK and Eigen) most likely implement a different type of QR algorithm.
An illustration of different sets of columns selected by QR in R vs. QR in Eigen:
- QR implemented in R respects the column order;
- QR implemented in Eigen (QR Module) doesn’t care about the colum order.
data(X16)
dim(X16)
#> [1] 1000 16
sel
#> [1] 4 16
unsel
#> [1] 1 2 3 5 6 7 8 9 10 11 12 13 14 15
# QR in R
out <- qr(X16)
with(out, head(pivot, rank))
#> [1] 1 3 4 7 15
# QR in Eigen
eigen_qr_keep(X16)
#> [1] 3 7 4 15 6
Given an example matrix with 16 columns and two columns, 4 and 16, selected to be retained by QR, two strategies perform differently.
# 1. QR with pre-selected columns: re-ordering columns fails
try(eigen_qri_keep(X16, sel))
#> Error in eigen_qri_keep(X16, sel) :
#> some selected columns were not kept by QR
# 2. QR with pre-selected columns: projeced seleced columns from X
eigen_qrp_keep(X16, sel)
#> [1] 3 4 6 7 16
# As a reference, usual QR in R
qr(X16)$rank
#> [1] 5