How to convert whole genomes (or exomes) into proteomes! The general flow of information is to take raw bams (Mapped to the GRCh38 reference genome) and
- Call variants (GATK4 currently. There is a boatload of steps involved here. We plan on moving to DeepVariant)
- process bams by
- marking duplicates
- recalibration base qualities (BQSR)
- Call snps (GATK's haplotypecaller)
- Identify true variants
- GATK's VQSR
- Identify true variants
- process bams by
- Phase variants
- Physically (WhatsHap)
- This is important for SNP variants that are close to each other!
- Statistically (shapeit4)
- Physically (WhatsHap)
- Predict haploid protein sequences (bcftools csq)
- Produce fasta sequences from these predictions (custom)
- An index these fasta files with a suffix array (protengine)
This involves downloading a boatload of genomic data (wget). As well as a poor-man's Make to embed where the data live.
git clone https://github.com/Ahhgust/Exome2Proteome.git && cd Exome2Proteome
wget --quiet -O data.tbz 'https://www.dropbox.com/s/jt9doz7ixmovhik/DataBundle_Genomes2Proteome.tbz?dl=1'
tar -xf data.tbz && rm data.tbz
(echo data=\"$PWD/data\"; echo bin=\"$PWD/bin\" ; tail -n +3 bin/bam2vcf.sh) > foo && mv -f foo bin/bam2vcf.sh
(echo data=\"$PWD/data\"; echo bin=\"$PWD/bin\" ; tail -n +3 bin/vcf2prot.sh) > foo && mv -f foo bin/vcf2prot.sh
chmod +x bin/*
Run:
bin/dependency_check.sh
This is a case of silence is golden-- if the script prints nothing and exits cleanly, then all of the programs needed to convert exomes to protomes have been installed. I leave it up to the user to ensure that the programs are reasonably up-to-date (hint hint you may need to update bcftools)
Convert bam to (unphased) VCF. GATK4-style. Assumptions: you're in a directory with 1+ bams in it (and nothing else). Each BAM file must correspond to a single sample. Read-groups must be set and valid Each BAM must also be aligned to the GRCh38 reference genome. This means UCSC-style chromosome naming (e.g., chr1, not 1). Only the canonical autosomes (chr1-chr22 are used)
For whole exome data:
bin/bam2vcf.sh '*.bam' 23 &> variantcalling.outerr
Uses up to 23 cores*
(actually that's a lie, Picard auto-magically uses way more than that, but it's not a parameter you can set)
For whole genome data:
bin/bam2vcf.sh '*.bam' 26 G &> variantcalling.outerr
Which (you guessed it) uses up to 26 cores.
Unphased VCFs should be in the directory vcfout/
And have names like E2P.1.recalibrated.vcf.gz
(meaning exome 2 proteome, chromosome 1, VQSR recalibrated)
Does physical phasing (whatshap) followed by statistical phasing (shapeit4). Recovers sites lost (dropped by shapeit4) in the statistical phasing step. Type:
bin/vcf2phase.sh 23 &> phasing.outerr
This uses up to 23 cores. The final vcfs are written to ``finalvcf/(one vcf per chromosome) A directory called
phasedvcf/``` is also made; it has temporary files (each chromosome x each person) and is only necessary for debugging.
bin/phasedvcf2prot.sh 23 &> proteome.outerr
All dependencies must be in your $PATH (ie, type any of the following , and it works)
- gatk (v4. Must have HaplotypeCaller))
- PicardCommandLine (picard)
- bcftools
- shapeit4
- whatshap
- bwa
- samtools
- python3
- R
Additionally, the following R packages are needed for generating an in silico trypsin digest:
- tools
- R.utils
- optparse
- cleaver
- Peptides
- stringr
- stringi
- seqinr
- yaml
- data.table