David Poznik
23andMe
October, 2016
yHaplo identifies the Y-chromosome haplogroup of each male in a sample of one to millions. It does not rely on any particular genotyping modality or platform, and it is robust to missing data, genotype errors, mutation recurrence, and other complications. Although full sequences yield the most granular haplogroup classifications, genotyping arrays can yield reliable calls, provided a reasonable number of phylogenetically informative variants has been assayed.
Briefly, haplogroup calling involves two steps. The program first builds an internal representation of the Y-chromosome phylogeny by reading its primary structure from (Newick-formatted) text and importing phylogenetically informative SNPs from the ISOGG database, affiliating each SNP with the appropriate node and growing the tree as necessary. It then traverses the tree for each individual, identifying for each the path of derived alleles leading to a haplogroup designation.
yHaplo is available for non-commercial use pursuant to the terms of the non-exclusive
license agreement, LICENSE.txt
. To learn more about the algorithm, please see our
bioRxiv pre-print:
Poznik GD. 2016. Identifying Y-chromosome haplogroups in arbitrarily large samples
of sequenced or genotyped men. bioRxiv doi: 10.1101/088716
And, to learn more about the software, please see the manual, yHaplo.manual.pdf
.
Please note that yHaplo does not check for sex status; it assumes all samples are male.
input/
y.tree.primary.DATE.nwk
: primary structure of the Y-chromosome treeisogg.DATE.txt
: phylogenetically informative SNPsisogg.correct.*.txt
: corrections to ISOGG dataisogg.omit.*.txt
: SNPs to drop due to inconsistencies observed in test dataisogg.multiallelic.txt
: physical coordinates of multiallelic sites to be excludedrepresentative.SNPs.*.txt
: SNPs deemed representative of corresponding haplogroups
.genos.txt
: sample-major genotypes- row 1: physical coordinates
- column 1: individual IDs
- cell (i, j): genotype for individual i at position j, encoded as a single character from the set { A, C, G, T, . }, with "." representing an unobserved value
.resid.txt
: file with 23andMe research IDs in the first column.vcf
,.vcf.gz
: snp-major VCF file.vcf4
: snp-major pseudo-VCF. differences include:- no "#" in header row
- fewer header columns
- GT values recorded as { A, C, G, T, . } rather than { 0, 1, . }
callHaplogroups.py
: for an overiew of command-line options, issue the following command: callHaplogroups.py -h
Tree
: knows root, depth, haplogroup-to-node mappings, etc.; parses a Newick file to build primary tree; parses ISOGG table to add SNPs to nodes and grow tree; finds the derived path leading from the root to an individualNode
: element of the tree. knows parent, children, snps, etc. represents the branch that leads to itSNP
: knows position, ancestral and derived alleles, node, etc.PlatformSNP
: knows position and ablock indexSample
: knows an individual's genotypes and haplogroupCustomer
: (subclass of Sample) has 23andMe metadata and genotypes from ablocksPath
: path through a tree; stores the next node to visit, a list of SNPs observed in the derived state, the most derived SNP observed, and the number of ancestral alleles encounteredPage
: 23andMe content page labelsConfig
: container for parameters, command-line options, and filenames
utils.py
: shared utility functions
convert2genos.py
: converts data to.genos.txt
formatplotTree.py
: plots a newick tree