/ngsF-HMM

Estimation of per-individual inbreeding tracts under a probabilistic framework

Primary LanguageC++GNU General Public License v3.0GPL-3.0

ngsF-HMM

ngsF-HMM is a program to estimate per-individual inbreeding tracts using a two-state Hidden Markov Model (HMM). Furthermore, instead of using called genotypes, it uses a probabilistic framework that takes the uncertainty of genotype's assignation into account; making it specially suited for low-quality or low-coverage datasets.

Citation

ngsF-HMM was published in 2016 at Bioinformatics, so please cite it if you use it in your work:

Vieira FG, Albrechtsen A and Nielsen R
Estimating IBD tracts from low coverage NGS data
Bioinformatics (2016) 32: 2096-2102

Installation

ngsF-HMM can be easily installed but has some external dependencies:

  • Mandatory:
    • gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
    • zlib: v1.2.7 tested on Debian 7.8 (wheezy)
    • gsl : v1.15 tested on Debian 7.8 (wheezy)
  • Optional (only needed for testing or auxilliary scripts):
    • md5sum
    • Perl packages: Getopt::Long, and IO::Zlib
    • R packages: optparse, and plyr

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsF-HMM.git

and run:

% cd ngsF-HMM
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsF-HMM [options] --n_ind INT --n_sites INT --glf glf/in/file --out output/file

Parameters

  • --geno FILE: input file with genotypes, genotype likelihoods or genotype posterior probabilities.
  • --pos FILE: input file with site coordinates (one per line), where the 1st column stands for the chromosome/contig and the 2nd for the position (bp); remaining columns are ignored.
  • --lkl: input are genotype likelihoods (BEAGLE format).
  • --loglkl: input are genotype log-likelihoods.
  • --n_ind INT: sample size (number of individuals).
  • --n_sites INT: total number of sites.
  • --call_geno: call genotypes before running analyses.
  • --freq DOUBLE or CHAR: initial frequency values. Can be defined by user as a DOUBLE, (r)andom, (e)stimated or read from a FILE.
  • --freq_est INT: allele frequency estimation method: 0) fixed (no optimizatopn); 1) assuming independence of sites; and 2) through haplotype frequency.
  • --indF DOUBLE-DOUBLE or CHAR: initial inbreeding and transition parameter values. Can be defined by user as a DOUBLE-DOUBLE, (r)andom, or read from a FILE.
  • --indF_fixed: inbreeding parameter values are fixed (will not be optimized).
  • --alpha_fixed: transition parameter values are fixed (will not be optimized).
  • --out CHAR: prefix for output files.
  • --log INT: dump LOG file each INT iterations. [0]
  • --log_bin: dump LOG file in binary.
  • --min_iters INT: minimum number of EM iterations. [10]
  • --max_iters INT: maximum number of EM iterations. [100]
  • --min_epsilon FLOAT: maximum RMSD between iterations to assume convergence. [1e-5]
  • --n_threads INT: number of threads to use. [1]
  • --verbose INT: selects verbosity level. [1]
  • --seed INT: seed for random number generator.

Input data

As input, ngsF-HMM accepts both genotypes, genotype likelihoods (GP) or genotype posterior probabilities (GP). Genotypes must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, 0, 1, 2]. The file can have a header and an arbitrary number of columns preceeding the actual data (that will all be ignored), much like the Beagle file format (link). As for GL and GP, ngsF-HMM accepts both gzipd TSV and binary formats, but with 3 columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

Stopping Criteria

An issue on iterative algorithms is the stopping criteria. ngsF-HMM implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).

To avoid convergence to local maxima, ngsF-HMM should be run several times from different starting points. To make this task easier, a script (ngsF-HMM.sh) is provided that can be called with the exact same parameters as ngsF-HMM.

Output files

ngsF-HMM will output several files, some depending on input options:

  • .indF: file similar to ngsF output format, where the first line stands for the final log-likelihood, followed by per individual (one per line) inbreeding coefficients (1st column) and transition rate parameters (2nd column) and, finally, the per-site minor allele frequency.
  • .ibd: file storing IBD tracts results, where the first line stores per-individual log-likelihoods. The following lines (one per individual) contain the most-probable inbreeding tracts coded as an INT per site (0: position is not IBD; 1: position is IBD), followed by the IBD posterior probabilities coded as a FLOAT per site (one line per individual).
  • .geno: binary file with genotype posterior probabilities (similar to ANGSD -doGeno 32).
  • .log.gz: if option -log INT is specified, a gziped log file similar to .ibd is printed every INT iterations.
  • .pdf: optionally, the scripts/ngsF-HMMplot.R script can be used to plot the IBD tracts.

Thread pool

The thread pool implementation was adapted from Mathias Brossard's and is freely available from: https://github.com/mbrossard/threadpool