/HAFpipe-line

calculate haplotype-derived allele frequencies for pool-seq samples

Primary LanguageShell

HAF-pipe 🏂

HAF-pipe is a bash- and python-based pipeline to calculate haplotype-inferred allele frequencies from pool-seq data and founder SNP profiles.

Paper

This tool is described in:

"Accurate Allele Frequencies from Ultra-low Coverage Pool-Seq Samples in Evolve-and-Resequence Experiments."
Tilk S, Bergland A, Goodman A, Schmidt P, Petrov D, Greenblum S.
G3: Genes|Genomes|Genetics, 9(12), 4159 LP - 4168, 2019. doi:10.1534/g3.119.400755

This work heavily relies on HARP. Hence, please do not forget to cite this as well:

"Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data."
Kessner D, Turner T, Novembre J.
Molecular Biology and Evolution 30 (5): 1145–58, 2013. doi:10.1093/molbev/mst016

When using the "npute" imputation method, please also cite their work; see the reference in the NPUTEv1 directory here. For further references, please see above papers.

Dependencies

  1. HARP (download from https://bitbucket.org/dkessner/harp)
  2. Python 3 (including numpy for the 'npute' method of Task 2)
  3. tabix and bgzip (http://www.htslib.org/download/)

Tasks

  • 1 - make SNP table from VCF
  • 2 - impute SNP table, if needed
  • 3 - infer haplotype frequencies
  • 4 - calculate allele frequencies

Usage

HAFpipe_wrapper.sh
    [ -t --tasks ]          comma-separated list of tasks to run
    [ -l --logfile ]        name of file to write log to
                                - default: HAFpipe-log.`date +%Y-%m-%d.%H%M%S`
    [ -d --maindir   ]      directory in which HAF-pipe is located; tasks:1,2,3,4
                                - default: directory of the main script
    [ -o --outdir ]         output directory; tasks:1,3,4
                                - default: current directory
    [ -v --vcf ]            a multi-sample VCF file of each individual founder in the starting population; tasks:1
                                - will be converted to a snp table
    [ -c --chrom ]          name of chromosome to extract from vcf; tasks:1
    [ -s --snptable ]       snp table to use for calculating haplotype and allele frequencies; tasks:1,2,3,4
                                - columns are: position, ref allele, followed by genotypes of the each individual founder
                                - See 'simulations/99.clean.SNP.HARP.segregating.gz' included in this repository for examples
                                - will be overwritten if task 1 is run in conjunction with other tasks
    [ -k --keephets ]       flag to code heterozygous calls with an IUPAC ambiguous base code (instead of N); tasks:1
    [ -u --subsetlist ]     name of a file with a single column of founder names to keep in the snp table; tasks:1
                                - must match sample names in the VCF
    [ -m --mincalls ]       minimum number of called genotypes to include a site in the SNP table; tasks:1
                                - default: 2
    [ -i --impmethod ]      method to use for imputation in task 2 or file extension for task 3; tasks:2,3
                                - for task 2, method must be one of:
                                - 'simpute' (simple imputation)
                                - 'npute' (see Roberts et al., 2007 - doi:10.1093/bioinformatics/btm220 )
                                - for task 3, method can be:
                                - 'none' or not specified (default; the original [snptable] with potential missing calls will be used to infer haplotype frequencies)
                                - any other string (will look for the file [snptable].[method] to infer haplotype frequencies)
    [ -n --nsites ]         number of neighboring sites to use with 'npute' imputation; tasks:2
                                - default: 20
    [ -b --bamfile ]        name of bamfile generated by pooled sequencing of sampled chromosomes from an evolved population; tasks:3,4
    [ -r --refseq ]         reference sequence in FASTA format; tasks:3
    [ -e --encoding ]       base quality encoding in bam files; tasks:3
                                - 'illumina' (Phred+64; default)
                                - 'sanger' (Phred+33)
    [ -g --generations ]    number of generations of recombination; used to calculate window size for haplotype inference; tasks=3
    [ -a --recombrate ]     recombination rate used to calculate window size for haplotype inference; tasks=3
                                - default: .0000000239 (avg D. melanogaster recomb rate)
    [ -q --quantile ]       quantile of expected unrecombined segment distribution to use for determining haplotype inference window size; tasks:3
                                - default: 18
    [ -w --winsize ]        user-defined window size (in kb) for haplotype inference; tasks:3
                            #(overrides -g and -a)
    [ -h help ]             show this help screen

Example

Creates a SNP table from a VCF of founder genotypes and uses it to estimate HAFs from a pooled sequencing sample.
In this example, inferred haplotype frequencies will be written to exptData/sampledInds_cage2_gen15.mapped_r5.39.2L.freqs and HAFs will be written to exptData/sampledInds_cage2_gen15.mapped_r5.39.2L.afSite

HAFpipe_wrapper.sh -t 1,2,3,4 \
-v refData/founderGenotypes.vcf \
-c 2L \
-s refData/founderGenotypes.segregating.snpTable \  ## will be created
-i simpute \
-b exptData/sampledInds_cage2_gen15.mapped_r5.39.bam \
-e sanger \
-g 15 \
-r refData/dmel_ref_r5.39.fa \
-o exptData

Full Pipeline

HAF-pipe takes per-sample bam files as input. Typically however, analysis projects start from the raw fastq files produced by sequencing the samples. If you are looking for an easy way to process these into bam files, and then run HAF-pipe on them, please have a look at the grenepipe pipeline. It offers a specialized workflow that trims and maps the reads, optionally runs quality filters and quality control on the data, and finally infers allele frequencies with HAF-pipe.

Simulations

To simulate recombination and pooled sequencing data from a panel of sequenced founders, see the scripts HAFpipe-sim.run_forqs.sh and HAFpipe-sim.simulate_poolseq.sh in the simulations folder.

  • You can use the first script to simulate recombination for an experimental population of any size initiated from any sequenced founder set for any number of generations with any number of selected sites with any selection coefficient.
  • You can use the second script to 'sample' any number of individuals from the recombined population at any generation and simulate pooled sequencing (150bp paired-end reads) of these individuals at any coverage with any sequencing error rate. This script will also record the true allele frequencies of all segregating sites in the set of sampled individuals. Compare HAFs calculated from the simulated sequencing data to these true allele frequencies to assess effective coverage.

(Note that more generations, larger pool sizes, and higher coverage values will all increase run time)

Effective Coverage Calculator

To calculate the accuracy (effective coverage) of HAFs according to specified experimental design choices in E + R experiments, see our shiny app: https://ec-calculator.shinyapps.io/shinyapp/