RcppStatGen

RcppStatGen is an R/RcppEigen package with code snippets for Statistical Genetics tasks.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("variani/RcppStatGen")

Examples

library(RcppStatGen)
library(magrittr) # for %>%

eigen_version()

Basic example of scaling a numeric matrix.

iris[, -5]  %>% head %>% as.matrix %>% eigen_scale
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5206576  0.3401105 -0.3627381 -0.4082483
#> [2,] -0.1735525 -1.1175060 -0.3627381 -0.4082483
#> [3,] -0.8677627 -0.5344594 -1.0882144 -0.4082483
#> [4,] -1.2148677 -0.8259827  0.3627381 -0.4082483
#> [5,]  0.1735525  0.6316338 -0.3627381 -0.4082483
#> [6,]  1.5619728  1.5062037  1.8136906  2.0412415

Example of scaling a matrix with missing data.

matrix(1:9, 3, 3) %>% eigen_scale_naive
#>      [,1] [,2] [,3]
#> [1,]   -1   -1   -1
#> [2,]    0    0    0
#> [3,]    1    1    1
matrix(c(NA, 2:9), 3, 3) %>% eigen_scale_naive
#>      [,1] [,2] [,3]
#> [1,]   NA   -1   -1
#> [2,]   NA    0    0
#> [3,]   NA    1    1
matrix(c(NA, 2:9), 3, 3) %>% eigen_scale
#>            [,1] [,2] [,3]
#> [1,]         NA   -1   -1
#> [2,] -0.7071068    0    0
#> [3,]  0.7071068    1    1

Benchmarks

Scale matrix

The standard R function scale requires ~10x more RAM than its Eigen implementations. Eigen implementations are also several times faster.

library(bench)

fun_bench <- function(n = 1e4, p = 1e3)
{
  X <- matrix(rnorm(n*p), n, p)
  bench::mark(scale(X), eigen_scale_naive(X), eigen_scale(X), check = FALSE, relative = TRUE)
}
fun_bench()
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 6
#>   expression             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 scale(X)              6.44   6.43      1         10.5     2.40
#> 2 eigen_scale_naive(X)  1      1         6.28       1       1.01
#> 3 eigen_scale(X)        1.25   1.29      5.00       1       1

Scale matrix in-place

Scaling matrix in-place requires almost no extra RAM (just needed to store means/sds for each column). Thus, the memory footprint is reduced by a factor of n = 1e4 (the number of rows).

This exercise was motivated by this SO question. In general, one expects from scaling function to take one input matrix of raw data (to keep untouched) and create another new matrix of per-column scaled data.

fun_bench_inplace<- function(n = 1e4, p = 1e3)
{
  X <- matrix(rnorm(n*p), n, p)
  bench::mark(eigen_scale(X), eigen_scale_inplace(X), check = FALSE)
}
fun_bench_inplace()
#> # A tibble: 2 x 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 eigen_scale(X)          224.4ms  224.4ms      4.46    76.3MB     8.91
#> 2 eigen_scale_inplace(X)   66.2ms   66.6ms     14.8     5.02KB     0

The extra RAM used by eigen_scale function (compared to eigen_scale_inplace) is equal to the size of matrix X (76Mb).

n <- 1e4
p <- 1e3
X <- matrix(rnorm(n*p), n, p)
object.size(X) %>% print(units = "auto")
#> 76.3 Mb
rm(X)

QR decomposition

Getting Q matrix from Eigen::HouseholderQR needs thinning. Code example. Code A_ortho = A.householderQr().householderQ(); link will return a matrix Q with the number of columns different from the matrix rank.
QR Decomposition results in eigen library differs from Matlab:

Note that the different underlying matrix packages (LAPACK and Eigen) most likely implement a different type of QR algorithm.

An illustration of different sets of columns selected by QR in R vs. QR in Eigen:

QR implemented in R respects the column order;
QR implemented in Eigen (QR Module) doesn’t care about the colum order.

data(X16)
dim(X16)
#> [1] 1000   16
sel
#> [1]  4 16
unsel
#>  [1]  1  2  3  5  6  7  8  9 10 11 12 13 14 15

# QR in R
out <- qr(X16)
with(out, head(pivot, rank))
#> [1]  1  3  4  7 15

# QR in Eigen
eigen_qr_keep(X16)
#> [1]  3  7  4 15  6

Given an example matrix with 16 columns and two columns, 4 and 16, selected to be retained by QR, two strategies perform differently.

# 1. QR with pre-selected columns: re-ordering columns fails
try(eigen_qri_keep(X16, sel))
#> Error in eigen_qri_keep(X16, sel) : 
#>   some selected columns were not kept by QR

# 2. QR with pre-selected columns: projeced seleced columns from X
eigen_qrp_keep(X16, sel)
#> [1]  3  4  6  7 16

# As a reference, usual QR in R
qr(X16)$rank
#> [1] 5