/terastructure

TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (10^12 observed genotypes, i.e., 1M individuals at 1M SNPs). This package provides a scalable, multi-threaded C++ implementation that can be run on a single computer.

Primary LanguageC++GNU General Public License v3.0GPL-3.0

This package implements a scalable, multi-threaded implementation of the TeraStructure algorithm for fitting a Bayesian model of genetic variationin human populations on tera-sample-sized data sets (10^12 observed genotypes, e.g., 1M individuals at 1M SNPs).

Installation

Required libraries: gsl, gslblas, pthread

On Linux/Unix run

./configure
make; make install

On Mac OS, the location of the required gsl, gslblas and pthread libraries may need to be specified:

./configure LDFLAGS="-L/opt/local/lib" CPPFLAGS="-I/opt/local/include"
make; make install

The binary terastructure will be installed in /usr/local/bin unless a different prefix is provided to configure. (See pkg/INSTALL.)

Citation

Fitting probabilistic models of genetic variation on millions of humans
P. Gopalan, W. Hao, D.M. Blei, J.D. Storey
bioRxiv link: http://biorxiv.org/content/early/2014/12/24/013227

Abstract

The goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown significantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure.

TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (10^12 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

TeraStructure: Population genetics inference software

terastructure [OPTIONS]

    -help            usage
    -file <name>     location by individuals matrix of SNP values (0,1,2)
    -n <N>           number of individuals
    -l <L>           number of locations
    -k <K>           number of populations
    -batch           run batch variational inference
    -stochastic      run stochastic variational inference
    -label           descriptive tag for the output directory

OPTIONAL
-file-suffix	 save files with the corresponding iteration as suffix
    -force           overwrite existing output directory
    -rfreq <val>     checks for convergence and logs output every <val> iterations
    -idfile <name>	 file containing individual name/meta-data, one per line
-seed <val>	 value is a real number (read as "double")
      		 sets the seed for the GSL library