/Exome2Proteome

Converts exome sequencing data into predicted proteins in FASTA format. It does so retaining *phase* (CIS variation)

Primary LanguageShell

How to convert whole genomes (or exomes) into proteomes! The general flow of information is to take raw bams (Mapped to the GRCh38 reference genome) and

  • Call variants (GATK4 currently. There is a boatload of steps involved here. We plan on moving to DeepVariant)
    • process bams by
      • marking duplicates
      • recalibration base qualities (BQSR)
    • Call snps (GATK's haplotypecaller)
      • Identify true variants
        • GATK's VQSR
  • Phase variants
    • Physically (WhatsHap)
      • This is important for SNP variants that are close to each other!
    • Statistically (shapeit4)
  • Predict haploid protein sequences (bcftools csq)
    • Produce fasta sequences from these predictions (custom)
    • An index these fasta files with a suffix array (protengine)

Installation (*nix only)

This involves downloading a boatload of genomic data (wget). As well as a poor-man's Make to embed where the data live.

git clone https://github.com/Ahhgust/Exome2Proteome.git && cd Exome2Proteome
wget --quiet -O data.tbz 'https://www.dropbox.com/s/jt9doz7ixmovhik/DataBundle_Genomes2Proteome.tbz?dl=1'
tar -xf data.tbz && rm data.tbz
(echo data=\"$PWD/data\"; echo bin=\"$PWD/bin\" ; tail -n +3 bin/bam2vcf.sh) > foo && mv -f foo bin/bam2vcf.sh
(echo data=\"$PWD/data\"; echo bin=\"$PWD/bin\" ; tail -n +3 bin/vcf2prot.sh) > foo && mv -f foo bin/vcf2prot.sh
chmod +x bin/*

Quick start

Dependency check

Run:

  bin/dependency_check.sh

This is a case of silence is golden-- if the script prints nothing and exits cleanly, then all of the programs needed to convert exomes to protomes have been installed. I leave it up to the user to ensure that the programs are reasonably up-to-date (hint hint you may need to update bcftools)

Bams -> unphased VCFs

Convert bam to (unphased) VCF. GATK4-style. Assumptions: you're in a directory with 1+ bams in it (and nothing else). Each BAM file must correspond to a single sample. Read-groups must be set and valid Each BAM must also be aligned to the GRCh38 reference genome. This means UCSC-style chromosome naming (e.g., chr1, not 1). Only the canonical autosomes (chr1-chr22 are used)

For whole exome data:

  bin/bam2vcf.sh '*.bam' 23 &> variantcalling.outerr 

Uses up to 23 cores*

(actually that's a lie, Picard auto-magically uses way more than that, but it's not a parameter you can set)

For whole genome data:

  bin/bam2vcf.sh '*.bam' 26 G &> variantcalling.outerr 

Which (you guessed it) uses up to 26 cores.

Unphased VCFs should be in the directory vcfout/ And have names like E2P.1.recalibrated.vcf.gz (meaning exome 2 proteome, chromosome 1, VQSR recalibrated)

Unphased VCFs -> phased VCFs

Does physical phasing (whatshap) followed by statistical phasing (shapeit4). Recovers sites lost (dropped by shapeit4) in the statistical phasing step. Type:

  bin/vcf2phase.sh 23 &> phasing.outerr 

This uses up to 23 cores. The final vcfs are written to ``finalvcf/(one vcf per chromosome) A directory calledphasedvcf/``` is also made; it has temporary files (each chromosome x each person) and is only necessary for debugging.

Phased VCF -> Proteomes in fasta

  bin/phasedvcf2prot.sh 23 &> proteome.outerr 

Dependencies (programs) (way too many! *nix only)

All dependencies must be in your $PATH (ie, type any of the following , and it works)

  • gatk (v4. Must have HaplotypeCaller))
  • PicardCommandLine (picard)
  • bcftools
  • shapeit4
  • whatshap
  • bwa
  • samtools
  • python3
  • R

Additionally, the following R packages are needed for generating an in silico trypsin digest:

  • tools
  • R.utils
  • optparse
  • cleaver
  • Peptides
  • stringr
  • stringi
  • seqinr
  • yaml
  • data.table