gdsfmt: R Interface to CoreArray Genomic Data Structure (GDS) files

GNU Lesser General Public License, LGPL-3

Features

This package provides a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which are portable across platforms with hierarchical structure to store multiple scalable array-oriented data sets with metadata information. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. The gdsfmt package offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype, like single-nucleotide polymorphism (SNP), usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. It is also allowed to read a GDS file in parallel with multiple R processes supported by the package parallel.

Bioconductor:

Release Version: v1.40.1

http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html

Help Documents

News

Package Vignettes

http://bioconductor.org/packages/release/bioc/vignettes/gdsfmt/inst/doc/gdsfmt.html

Citations

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.

Package Maintainer

Dr. Xiuwen Zheng (zhengxwen@gmail.com)

URL

https://bioconductor.org/packages/gdsfmt

https://github.com/zhengxwen/gdsfmt

Installation

Bioconductor repository:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("gdsfmt")

Development version from Github (for developers/testers only):

library("devtools")
install_github("zhengxwen/gdsfmt")

The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

Copyright Notice

CoreArray C++ library, LGPL-3 License, 2007-2021, Xiuwen Zheng
zlib, zlib License, 1995-2017, Jean-loup Gailly and Mark Adler
LZ4, BSD 2-clause License, 2011-2019, Yann Collet
liblzma, public domain, 2005-2018, Lasse Collin and other xz contributors
README

GDS Command-line Tools

In the R environment,

install.packages("getopt", repos="http://cran.r-project.org")
install.packages("optparse", repos="http://cran.r-project.org")
install.packages("crayon", repos="http://cran.r-project.org")

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("gdsfmt")

viewgds

viewgds is a shell script written in R (viewgds.R), to view the contents of a GDS file. The R packages gdsfmt, getopt and optparse should be installed before running viewgds, and the package crayon is optional.

Usage: viewgds [options] file

Installation with command line,

curl -L https://raw.githubusercontent.com/zhengxwen/Documents/master/Program/viewgds.R > viewgds
chmod +x viewgds

## Or
wget -qO- --no-check-certificate https://raw.githubusercontent.com/zhengxwen/Documents/master/Program/viewgds.R > viewgds
chmod +x viewgds

diffgds

diffgds is a shell script written in R (diffgds.R), to compare two files GDS files. The R packages gdsfmt, getopt and optparse should be installed before running diffgds.

Usage: diffgds [options] file1 file2

Installation with command line,

curl -L https://raw.githubusercontent.com/zhengxwen/Documents/master/Program/diffgds.R > diffgds
chmod +x diffgds

## Or
wget -qO- --no-check-certificate https://raw.githubusercontent.com/zhengxwen/Documents/master/Program/diffgds.R > diffgds
chmod +x diffgds

Examples

library(gdsfmt)

# create a GDS file
f <- createfn.gds("test.gds")

add.gdsn(f, "int", val=1:10000)
add.gdsn(f, "double", val=seq(1, 1000, 0.4))
add.gdsn(f, "character", val=c("int", "double", "logical", "factor"))
add.gdsn(f, "logical", val=rep(c(TRUE, FALSE, NA), 50))
add.gdsn(f, "factor", val=as.factor(c(NA, "AA", "CC")))
add.gdsn(f, "bit2", val=sample(0:3, 1000, replace=TRUE), storage="bit2")

# list and data.frame
add.gdsn(f, "list", val=list(X=1:10, Y=seq(1, 10, 0.25)))
add.gdsn(f, "data.frame", val=data.frame(X=1:19, Y=seq(1, 10, 0.5)))

folder <- addfolder.gdsn(f, "folder")
add.gdsn(folder, "int", val=1:1000)
add.gdsn(folder, "double", val=seq(1, 100, 0.4))

# show the contents
f

# close the GDS file
closefn.gds(f)

File: test.gds (1.1K)
+    [  ]
|--+ int   { Int32 10000, 39.1K }
|--+ double   { Float64 2498, 19.5K }
|--+ character   { Str8 4, 26B }
|--+ logical   { Int32,logical 150, 600B } *
|--+ factor   { Int32,factor 3, 12B } *
|--+ bit2   { Bit2 1000, 250B }
|--+ list   [ list ] *
|  |--+ X   { Int32 10, 40B }
|  \--+ Y   { Float64 37, 296B }
|--+ data.frame   [ data.frame ] *
|  |--+ X   { Int32 19, 76B }
|  \--+ Y   { Float64 19, 152B }
\--+ folder   [  ]
   |--+ int   { Int32 1000, 3.9K }
   \--+ double   { Float64 248, 1.9K }

Also See

pygds: Python interface to CoreArray Genomic Data Structure (GDS) files

jugds.jl: Julia interface to CoreArray Genomic Data Structure (GDS) files