/bsh-denovo

Sample genotypes from BAM alignments -- discover sites de novo

Primary LanguageCMIT LicenseMIT

[B]AM [S]ample [H]aplotypes - denovo

About

bsh-denovo aims to identify sites which show variation across a set of samples, and to randomly sample pseudo-haploid genotypes at these positions. Alignment data is supplied in a multi-sample BAM file, where samples are identified using read groups (RG-tags). Variable positions are identified by applying a set of filters within and across samples (see below). After identification of variable positions, a base is sampled randomly from each read group. The output is generated as a .map and .ped file pair (see here for file format descriptions) as used by PLINK.

In addition to fully random sampling, bsh-denovo also implements the Consensify method (see here).

Get it

To use bsh-denovo, you will need a Linux machine where the gcc compiler, as well as the libz, libm and libpthread system libraries are installed. To acquire bsh-denovo, first clone the repository recursively:

git clone --recursive https://github.com/clwgg/bsh-denovo

This will clone both the bsh-denovo code, as well as the htslib module, which is its only external dependency. After cloning, first compile the submodule, and then the bsh-denovo code:

cd bsh-denovo
make submodules
make

This will create the bsh-denovo binary, which you can copy or move anywhere for subsequent use.

Updating

When updating to the current version, please make sure to also update the submodules:

git pull origin master
git submodule update
make submodules
make

Usage

Usage: ./bsh-denovo [options] multi_RG.bam

Options:
	-d	Produce stats file for debugging and analytics.
	-o	Output file base (default: out)
	-q	Minimum mapping quality (default: 0)

    Position discovery (options here don't apply to the sampling of genotypes):
	-m	Data completeness required across samples.
	  	Can be specified as a fraction (0 to 1) or an absolute count (>1).
	  	(per position, default: 0.5)
	-a	Minimum in-sample frequency a base must have to be considered.
	  	(within sample, default: highest count - random if tied)
	-f	Minimum minor allele frequency (MAF) across samples.
	  	Can be specified as a fraction or an absolute count.
	  	When fraction: MAF at each site relative to non-missing individuals.
	  	When count:    Minor allele count relative to total individual count.
	  	(default: 1/n)
	-i	Report also invariant sites.
	  	This disables '-f', but '-a' and '-m' are still applied.
	  	(default: off)

    Base sampling:
	-c	Use Consensify method for base sampling.


Input BAM file

The input BAM file should contain mapped reads from multiple samples, merged into one file. Samples are distinguished by their RG-tag, which can be set as an option in most mapping tools. These single-sample BAMs labelled with an RG-tag are then merged into one file using e.g. samtools merge.

A note on site discovery vs. base sampling

Please note, that the filters -m, -a, and -f are only applied during position discovery. Restrictions on base sampling, such as the Consensify method, are applied subsequent to the position discovery, and may introduce additional missing data. Therefore, the final genotypes in the PED file may have more missing data, or a lower minimum minor allele frequency, than specified during position discovery. Similar filters on the final sampled genotypes can be applied for example using PLINK.

The stats file

The stats file is a TAB separated file for diagnostics and debugging. Please note that the file is written in plain text, and can get quite large for large sample sets. The stats file tracks the sequence ID and the position in that sequence for each identified site. At these sites, it reports the observed raw counts of A, C, G and T (without sampling), as well as the sampled base, for every sample (Read Group). The columns are:

SEQ\tPOS\tRG\tA\tC\tG\tT\tSM\n