plinktools: A Python repository from iainrb

Plinktools: Python code to process Plink genotyping files

Author: Iain Bancarz, ib5@sanger.ac.uk

1. INTRODUCTION

Plinktools is a Python package for processing Plink format genotyping data. It 
was developed for the following applications not supported by the standard 
Plink executable.

1.1 Equivalence test

Plink does not directly support an equivalence test on two datasets. The 
--merge option does allow a diff on SNP calls, but this does not check whether 
the SNP and sample sets match. In addition, certain Plink operations (such as 
merge) may transpose the major and minor alleles in a SNP set, and such a 
transpose is regarded as a mismatch by the Plink diff.

Plinktools allows an equivalence test on pairs of binary or non-binary Plink 
files, with major/minor allele swaps regarded as matching.

1.2 Fast binary merge

The standard Plink merge function performs extended cross-checking and recoding 
of data. This is rather slow, and does not scale well with an increasing 
number of inputs.

To address this issue, Plinktools implements a fast binary merge for two 
important special cases: Congruent SNP sets with disjoint samples, and 
congruent samples with disjoint SNPs. In both cases, input and output is in 
the default SNP-major format. In the case of congruent samples, the fast 
merge strips off headers and concatenates the input .bed files with no need 
for recoding. For congruent SNPs, a fast merge can be done if the number of 
samples in each input is divisible by 4; otherwise recoding is necessary and 
the merge will be slowed, although still somewhat faster than Plink.

1.3 Heterozygosity calculation by high/low MAF

As part of the Wellcome Trust Oxford SOP for exome chip QC, heterozygosity for each sample is calculated separately for SNPs with minor allele frequency above and below 1%. This test has been implemented in Plinktools.

2. USAGE

Plinktools includes three front-end scripts:

compare.py to compare two Plink datasets (binary or non-binary)
merge_bed.py to merge two or more Plink binary datasets
het_by_maf.py to compute heterozygosity for high/low MAF

Run any script with --help for more information.

3. SEE ALSO

The Plink data format was created by Shaun Purcell et al. See: 
http://pngu.mgh.harvard.edu/~purcell/plink

Plinktools was created to support the WTSI genotyping pipeline: 
https://github.com/wtsi-npg/genotyping

Plinktools is a prerequisite for the WTSI extension of the zCall genotype 
caller: https://github.com/wtsi-npg/zCall
iainrb/plinktools