How to create a huge on-disk array directly from R
kokitsuyuzaki opened this issue · 4 comments
To analyze a huge multi-dimensional array, I checked some on-disk implementations such as DelayedArray
, HDF5Array
, and TileDBArray
but all of them seem to assume that a huge array is already stored in HDF5 or TileDB and if we want to create an on-disk array in R, we can create only a small array that can fit in memory.
For example, in this code, small_arr
can be created but large_arr
cannot be created because we have to create a huge in-memory array first and then it will be converted to RleArray
.
library("HDF5Array")
small_arr <- HDF5Array::writeHDF5Array(
array(runif(10*20*30), dim=c(10,20,30)))
large_arr <- HDF5Array::writeHDF5Array(
array(runif(10000*1000*1000), dim=c(10000,1000,1000)))
Would it be possible to create an HDF5 file and later define the size and values of the arrays to be stored in it as follows?
large_arr <- HDF5ArraySeed(
filepath = "seed.h5",
name = "arr"
)
# size
dim(large_arr) <- c(10000,1000,1000)
# value
for(i in seq(10000)){
large_arr[i,,] <- sample(1000*1000)
}
Although I found that HDF5Array
(HDF5ArraySeed
) assumes that input object is array.
HDF5Array/R/HDF5ArraySeed-class.R
Line 7 in 5289a84
IMHO this question is more relevant to rhdf5; I will mention TileDb at the end. You can initialize an HDF5 disk store with known or unlimited dimensions and then populate it piece by piece as allowed by your systems. If it is not clear how to do this with instructions at, e.g., https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html#creating-an-hdf5-file-and-group-hierarchy, go as far as you can and then pose further questions at support.bioconductor.org, tagging with rhdf5. This probably should be an FAQ. Once you have the HDF5 resource on disk, you can apply DelayedArray methods. For TileDb, it might be effective to pose the question at https://forum.tiledb.com/, but support.bioconductor.org is also an option.
As Vince said, writing your own arbitrary data to an HDF5 file doesn't need to involve DelayedArray objects and can be easily achieved with plain use of the rhdf5 package. However, the DelayedArray/HDF5Array framework provides RealizationSink objects and the write_block()
function to make this more convenient, and to abstract away the details of the particular backend being used (e.g. HDF5 file or TileTB). This helps make the code simpler, easier to understand, and portable across backends.
See ?write_block
in the DelayedArray package for more information.
Please note that, whatever you use (rhdf5 package directly or RealizationSink object + write_block()
), there's no requirement that the array must be small enough to fit in memory. You should be able to create on-disk arrays of arbitrary size as long you have enough space on your hard drive.
H.
Great. I'm closing this. If you have further questions, please open a new issue (on the rhdf5 repo if the question is rhdf5 related), or ask on the Bioconductor support site. Thanks!