/STRDenovoTools

Toolkit for calling and analyzing de novo STR mutations

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Build Status DOI

MonSTR

MonSTR is a toolkit for identifying and analyzing de novo tandem repeat mutations. The core code for calling de novos and parsing VCF files is modified from that originally written by Thomas Willems in the HipSTR repository.

Authors:

Table of contents:

  1. Installation instructions
  2. Basic usage
  3. Preparing your input VCF file
  4. Detailed filtering options
  5. Specifying mutation models
  6. Output file formats
  7. Additional scripts
  8. Citing MonSTR

Install

You can install MonSTR using the following commands:

git clone https://github.com/gymreklab/STRDenovoTools  
cd STRDenovoTools
mkdir build
cd build
cmake ..
make
sudo make install

You will need to have a C-compiler and cmake installed. Type MonSTR --help to check that MonSTR installed successfully.

Basic usage

MonSTR takes as input a VCF file generated by HipSTR or GangSTR and a pedigree file in .fam format describing family relationships. It outputs text files describing mutation probabilities at each locus in each trio.

Important The VCF files must contain genotype likelihoods unless running in naive mode. The VCFs must have been run on all samples in a family jointly so that their genotype likelihood fields contain info for the same set of alleles. See "Preparing your VCF file" below. Sample IDs in the pedigree file must match exactly to those in the VCF header.

Below shows a basic MonSTR command:

MonSTR \
    --strvcf test.vcf.gz \
    --fam family.fam \
    [--gangstr] \
    --out test

By default, a HipSTR VCF is expected. If using a GangSTR VCF, use the option --gangstr. VCFs must be sorted and indexed using tabix. This can be accomplished by:

bgzip file.vcf
tabix -p vcf file.vcf.gz

The following example command can be run using test data available in the TestData folder of this repository:

MonSTR \
  --strvcf TestData/Q3.filtered.vcf.gz \
  --fam TestData/TestQuadFam.ped \
  --out test \
  --gangstr

Type MonSTR --help, or the documentation below, to see more detailed information about which options are available.

Preparing your VCF file

By default, MonSTR computes a joint likelihood of all possible genotype configurations at each locus in trio. To do so, it must have detailed genotype likelihood information specifying the likelihood of each possible diploid genotype, and the set of possible genotypes considered must be identical in each family member. Therefore, when generating your input VCF file with HipSTR or GangSTR, you must:

  • Run HipSTR or GangSTR jointly on all family members. If they are genotyped separately, each VCF file might contain different sets of alternate alleles, making GL fields incomparable across files. Note, you may include multiple families in a single run and output VCF file, as long as families are not split across VCFs.
  • If using HipSTR, use the --output-gls option to include the GL field in the VCF output. This is a standard VCF fielf for specifying genotype likelihoods (see VCF spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf)
  • If using GangSTR, use the --include-ggl option to include the GGL field. This is a modified version of the GL field to accommodate the larger number of alternate alleles considered by GangSTR. It is described on the GangSTR webpage.

The sample IDs in the VCF output file must match those in the .fam file and should not be repeated across families.

Note, GangSTR only estimates the number of repeat units. It will not output sequence differences, or differences from the reference allele consisting of a non-integer number of repeats. HipSTR can detect sequence differences, or non-integer changes in repeat copy number. To collapse alleles or the same length, or round alleles to the nearest repeat copy number, use the following options:

  • --combine-alleles-by-length: Collapse alleles of the same length to one.
  • --round-alleles: Round allele lengths to nearest repeat unit.

Filtering options

MonSTR provides the following filtering options:

General filters:

  • --min-coverage <INT>: Discard calls with less than this much coverage, based on the VCF FORMAT:DP field. Default: 0.
  • --min-score <FLOAT>: Discard calls with less than this score, based on the VCF FORMAT:Q field. Default: 0.
  • --filter-hom: Filter calls where the child is homozygous for new allele.
  • --require-all-chilren: Discard loci in family where not all children have calls. For example, if processing a quad family, but only one child has a call, discard the locus.
  • --family <STR>: Restrict to analyzing this family (based on the family ID in the .fam file).
  • --require-num-children <INT>: Require family to have this many children. Default: 0.
  • --region <STR>: Restrict to loci in this region (chrom:start-end).
  • --period <INT>: Restrict to loci with this repeat unit length.
  • --max-num-alleles <INT>: Filter loci with more than this many alleles. Default: 25.

HipSTR-specific filters:

  • --min-span-coverage <INT>: Discard calls with less than this many spanning reads, based on the VCF FORMAT:MALLREADS field. Default: 0.
  • --min-supp-reads <INT>: Discard calls with an allele supported by less than this many reads, based on the VCF FORMAT:MALLREADS field. Default: 0.

GangSTR-specific filters:

  • --min-num-encl-child <INT>: Require this many enclosing reads supporting de novo allele. Based on the VCF FORMAT:ENCLREADS field. Default: 0.
  • --max-num-encl-parent <INT>: Discard if more than this many enclosing reads support the de novo allele in parent. Based on the VCF FORMAT:ENCLREADS field. Default: 1000.
  • --max-perc-encl-parent <FLOAT>: Discard if more than this percentage enclosing reads support de novo allele in parent. Based on the VCF FORMAT:ENCLREADS field. Default: 1.0.
  • --min-encl-match <FLOAT>: Discard if fewer than this percentage of enclosing reads match the call. Based on the VCF FORMAT:ENCLREADS and FORMAT:REPCN fields. Default: 0.0.
  • --min-total-encl <INT>: Require this many total enclosing reads in each sample. Based on the VCF FORMAT:ENCLREADS field. Default: 0.

MonSTR will only process a trio of all samples (mother, family, and child) pass all specified filters.

Specifying custom mutation models

By default, MonSTR uses a statistical model to compute a posterior probabilty of mutation at each TR in each child ("Model-based calling" below). You may also use a naive method ("Naive calling") to identify mutations.

Model-based calling

MonSTR computes a a joint likelihood across parent and child genotypes for all possible genotype configurations compatible with either Mendelian inheritance or a mutation occurring. It then uses these likelihoods to compute the posterior probability of a mutation. Besides genotype likelihoods, which are computed by GangSTR or HipSTR, there are three additional components to the mutation model used:

  • Prior genotype probabilities, which give the probability to see each possible genotype in the parents. By default, a uniform prior is used, assuming each genotype has equal probability. This not a very realistic prior, but is simple, and we have found using more complex genotype priors makes little to no difference in the results. To instead base prior genotype probabilities on allele frequencies computed from samples in your VCF, use the option --use-pop-priors. This option is only supported for HipSTR input.

  • Transition probabilities, which give the probability of mutating from one allele size to another. Here allele size always refers to the number of repeat units. If using GangSTR, or HipSTR sequence-based alleles, MonSTR will assume a uniform probability to mutate from one allele to each other possible alleles. If using HipSTR with the --combine-alleles-by-length option, you can additionally specify more complex transition parameters, based on mutation parameters described in Gymrek, et al. 2017.

    • GeomP describes the parameter for the geometric step size distribution. If this is equal to 1, all steps are a single unit. If it is less than 1, larger step sizes are allowed. --default-geomp <FLOAT> can be used to set a global GeomP value (default: 1.0).
    • Beta describes a length dependent directionality bias. If Beta=0, then there is no bias. If Beta > 0, long alleles (relative to a "central" allele) tend to contract and short alleles tend to expand. --default-beta <FLOAT> and --default-central <INT> can be used to specify a global Beta and central allele value. (defaults Beta=0.0, central allele=0). The central allele is given in terms of the number of repeats relative to the reference genome.

Alternatively you can set per-locus parameters as described below.

  • Prior mutation probabilities, which give the per locus per generation mutation rate of a TR. These are used to compute posterior probabilities of mutation. You can set a global log10 mutation rate to be used for all loci with --defult-prior <FLOAT> (default -3). Alternatively you can set per-locus parameters as described below.

In addition to providing global mutation parameters, you may also specify a file with per-locus parameters using the option --mutation-models <FILE>. This file should be whitespace delimited with a single line per locus and no header line. The file should have the following columns present in each line (in order): chromosome, TR start position, TR end position, repeat unit length (bp), log10 mutation rate, beta, geomp, central_allele. The start coordinate must match exactly to the VCF INFO:START field in the HipSTR VCF for each locus.

Naive calling

In addition to model-based mutation calling, MonSTR implements a naive mutation detection algorithm.

  • Specify the --naive option to simply check if calls follow Mendelian inheritance. In this case the posterior probabilities for mutations are always set to 1, else 0 for no mutation.

  • Specify the option --naive-expansions-frr <int1,int2> to implement a naive calling method to identify candidate expansions. This looks for <int1> FRR (fully repetitive reads) in the child and 0 in the parent. If there are no FRR reads present, look for loci with at least <int2> flanking reads in the child greater than the largest parent allele. This only works with GangSTR input. To see a full description of FRR and flanking reads, see the GangSTR documentation and publication. This option uses information from the VCF FORMAT:FLNKREADS and FORMAT:RC fields. Posteriors for candidate expansions are set to -1.

Chromosome X

If analyzing mutations on the X chromosome, specify the --chrX option. This assumes the entire VCF is for chromosome X. Sex information must be available for children. You can use either model-based or naive mutation calling on chromosome X.

If using --chrX, it is recommended that males have haploid calls. If you are using GangSTR, you can use --bam-samps and --samp-sex to perform sex-aware calling of sex chromosomes.

Output formats

MonSTR outputs files <OUTPREFIX>.all_mutations.tab and <OUTPREFIX>.locus_summary.tab (<OUTPREFIX> is specified by the --out option). The columns in these files are described below.

The following options also control which loci/mutations are output:

  • --posterior-threshold <FLOAT>: Posterior probability threshold to call a mutation. Default: 0.9.
  • --include-invariant: Output info for loci even if there are no alternate alleles present. By default, invariant loci are excluded.
  • --output-all-loci: Output all calls to the all_mutations output file, regardless of posterior score. This can be useful if you want to analyze properties as a function of posterior threshold, or want to decide later on an appropriate posterior threshold.

<OUTPREFIX>.all_mutations.tab

This file contains one line per TR per trio. It contains the following columns:

Column header Description
chrom The chromosome of the TR
pos The postion of the TR
period The length of the repeat unit (in bp)
prior The prior probability of mutation
family The family ID of the trio (taken from the .fam file)
child The sample ID of the child in the trio
phenotype The phenotype of the child, taken from the .fam file
posterior The posterior probability of mutation in this child at this TR locus. If using naive calling, this is always set to 1 for mutations, else 0. If using naive expansion detection, this is set to -1 for candidate expansions.
newallele The allele arising from a de novo mutation (number of repeat units). This field is only meaningful if there was a mutation inferred.
mutsize The size of the mutation (number of repeat units). This field is only meaningful if there was a mutation inferred.
inparents Set to 1 if the new allele is found in the parents, else 0. This field is only meaningful if there was a mutation inferred.
poocase Gives the inferred parent of origin of the mutation, if a mutation was inferred. 0: unknown; 1: Mendelian (no denovo); 2: New allele from father; 3: New allele from mother; 4: Unclear
isnew Set to 1 if the new alleles is not found in any control samples in the input VCF. Otherwise set to 0. This field is only meaningful if there was a mutation inferred.
case_count Count of the new allele in cases (phenotype=2). This field is only meaningful if there was a mutation inferred.
ctrl_count Count of the new allele in controls (phenotype=1). This field is only meaningful if there was a mutation inferred.
unk_count Count of the new allele in samples with unknown phenotype (phenotype=0). This field is only meaningful if there was a mutation inferred.
child_gt Genotype of the child sample.
mat_gt Genotype of the mother.
pat_gt Genotype of the father.
encl_child Number of enclosing reads of the new allele in the child. This field is only meaningful if there was a mutation inferred.
encl_mother Number of enclosing reads of the new allele in the mother. This field is only meaningful if there was a mutation inferred.
encl_father Number of enclosing reads of the new allele in the father. This field is only meaningful if there was a mutation inferred.
encl_parent Number of encloseing reads of the new allele in the parent the mutation came from (see poocase).
long_mother Indicates whether the longer vs. shorter allele was transmitted from the mother. This is only relevant if the mother is heterozygous. 0=unknown; 1=shorter; 2=longer.
long_father Indicates whether the longer vs. shorter allele was transmitted from the father. This is only relevant if the father is heterozygous. 0=unknown; 1=shorter; 2=longer.

<OUTPREFIX>.locus_summary.tab

This file contains a single line per TR. It contains the following columns:

Column header Description
chrom The chromosome of the TR
pos The postion of the TR
period The length of the repeat unit (in bp)
ref_allele_size The number of repeats in the reference genome
num_alleles_bylength The number of possible alleles at the locus, collapsing all alleles with the same length.
num_alleles_byseq The number of possible alleles at the locus. For GangSTR, this will always be the same as num_alleles_by_length since it only considers length differences.
het_by_seq Heterozygosity of the locus
total_children The total number of children (trios) analyzed at the locus.
total_mutations The total number of mutations called at the locus.
total_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions.
affected_children The total number of affected children (trios) analyzed at the locus.
affected_mutations The total number of mutations called at the locus in affected samples.
affected_new_mutations The total number of mutations called at the locus in affected samples resulting in alleles unobserved in the rest of the cohort.
affected_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only affected samples.
unaffected_children The total number of unaffected children (trios) analyzed at the locus.
unaffected_mutations The total number of mutations called at the locus in unaffected samples.
unaffected_new_mutations The total number of mutations called at the locus in unaffected samples resulting in alleles unobserved in the rest of the cohort.
unaffected_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only unaffected samples.
p-value P-value based on a Fisher's exact test testing if mutations are enriched in affecteds vs. unaffecteds.
children_with_mutations Comma-separated list of children with mutations. Format of each child is famid:phenotype:newallele:controlcount,casecount,unknowncount.
aff_tdt Keep track of how often the long vs. short allele were transmitted from mother and father in affected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file.
unaff_tdt Keep track of how often the long vs. short allele were transmitted from mother and father in unaffected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file.

Additional scripts

See scripts/ for additional scripts:

  • qc_denovos.py is a script for further filtering the output of <OUTPREFIX>.all_mutations.tab, for example to remove loci, families or children with an outlier number of total mutations. Type python3 scripts/qc_denovos.py to see full usage information.

Cite

If you use this method for your research, please cite: Mitra, et al. 2020 https://www.biorxiv.org/content/10.1101/2020.03.04.974170v1