/SHAVE2

SHort-read Alignment Pipeline for VEctors - v.2 with GATK's HaplotypeCaller

Primary LanguageJupyter NotebookGNU Affero General Public License v3.0AGPL-3.0

SHAVE2: SHort-read Alignment pipeline for VEctor v.2 - with HaplotypeCaller variant caller - SLURM version

Developper Maintener MacOSX GNU/Linux Maintened Wiki Open Source GNU AGPL v3 Github Bash Python Snakemake Conda

~ ABOUT ~

SHAVE2 is a bioinformatic pipeline used for mosquitoes (Aedes / Anopheles) genome alignments from Illumina short reads, based on GATK Best Practices, except for the BQSR and VQSR steps.

In brief, SHAVE2 remove adatpers, report quality reads, aligns reads to a reference genome, fix incorrect mates, mark duplicates, add indel qualities to BAM files and call variants and genotypes.

Note about Variant and Genotype Calling

Indel realignment was dropped by Broad Institute about three years ago, as they found that this step was no longer useful when the variant calling was done with HaplotypeCaller or Mutect2, which implement a more sophisticated and effective form of realignment. see here. SHAVE2 use HaplotypeCaller in ERC mode to call variants and obtain the genotype likelihoods. Then, GenomicsDBImport import single-sample GVCFs into GenomicsDB before GenotypeGVCFs perform a joint genotyping on samples pre-called with HaplotypeCaller. The last major step is a fully customizable hard-filtering, using GATK VariantFiltration instead of Variant Qualtity Score Recalibration (VQSR).

Note about BQSR and VQSR

Base Quality Score Recalibration step needs as input a known variation VCF file, refering to the Ensembl-Variation database or dbSNP database who stores areas of genome that differ between individual genomes ("variants"). However, we do not have any prior list of known variants for our mosquito species, that's why we cannot do BQSR.

But the value of BQSR is increasingly being questioned as mappers and callers are typically updated. As an example: using HaplotypeCaller instead of UnifiedGenotyper greatly improves the handling of indels.

Variant Quality Score recalibration is probably the hardest part of the Best Practices to get right according to Broad Institute. In a nutshell, it is a sophisticated filtering technique applied on the variant callset that uses machine learning to model the technical profile of variants in a training set and uses that to filter out probable artifacts from the callset.

The key point is that it use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (the training set). Unfortunately, no highly validated variant resource is available for mosquitoes at this time, so we decided to apply hard-filtering and leave the choice of parameters to the user.

Written for MOVE-ADAPT project.

Features

  • Control reads quality (multiQC html report) and clean it
  • Align reads (bam files),
  • Mark duplicates to BAM files,
  • Add MD and NM tag to BAM files,
  • Fix mates,
  • BAM file validation according to SAM/BAM specifications,
  • Variants calling (vcf files),
  • Genotyping,
  • VCF compression,
  • VCF indexing

Version

V2.2023.07.13

Directed Acyclic Graph

~ INSTALLATIONS ~

SHAVE2

Clone (HTTPS) the SHAVE2 repository on github:

git clone git@github.com:ltalignani/SHAVE2.git

Difference between Download and Clone:

  • To create a copy of a remote repository’s files on your computer, you can either Download or Clone the repository
  • If you download it, you cannot sync the repository with the remote repository on GitHub
  • Cloning a repository is the same as downloading, except it preserves the Git connection with the remote repository
  • You can then modify the files locally and upload the changes to the remote repository on GitHub
  • You can then update the files locally and download the changes from the remote repository on GitHub
git pull --verbose

Then, upload the SHAVE2/ archive on your slurm cluster.

~ USAGE ~

  1. Copy your paired-end reads in .fastq.gz format files into: .raw/ directory
  2. Execute the sbatch Start_shave2_slurm.sh bash script to run the SHAVE2 pipeline
sbatch Start_shave2_slurm.sh

Yours analyzes will start with default configuration settings

Option-1: Edit config.yaml file in ./config/ directory
Option-2: Edit fastq-screen.conf file in ./config/ directory

~ RESULTS ~

Yours results are available in ./results/ directory, as follow:
(file names keep track which tools / params was used: <sample>_<aligner>_<mincov>)

00_Quality_Control

File Object
fastq-screen raw reads putative contaminations reports for each samples, in html, png and txt formats
fastqc raw reads quality reports for each samples, in html and zip formats
multiqc fastq-screen and fastqc results agregation report for all samples, in html format
qualimap facilitate the quality control of alignment sequencing data and its derivatives like feature counts, in html format
validatesamfile This tool reports on the validity of a SAM or BAM file relative to the SAM format specification, in txt format

01_Trimming

File Object
trimmomatic/ directory paired reads, without adapters and quality trimmed, in fastq.gz format

02_Mapping

File Object
mark-dup.bam read alignments, in bam format (can be visualized in, i.e. IGV)
mark-dup.bai bam indexes bai use in i.e. IGV with ./resources/genomes/AalbF3.fasta
markdup_metrics.txt bam metrics created by Picard MarkDuplicates

04_Polishing

File Object
md_fixed.bam polished bam files , in bam format
md_fixed.bam.bai polished bam indexes, in bai format

05_Variants

File Object
variant-call.g.vcf SNVs and Indels calling in gvcf format
variant-call.g.vcf.idx SNV and Indels calling in idx format
genomicdb g.vcfs merged by GenomicsDBImport for GenotypeGVCFs
genotyped.vcf SNVs genotyped in vcf format
genotyped.vcf.idx genotype index in idx format (automatically generated by GenotypeGVCFs)
combinedGVCF.vcf.gz alternative to GenomicsDBImport, is a merging of GVCFs that can eventually be input into GenoTypeGVCFs
GenotypeGVCFs.hf.vcf.gz genotyped SNVs hard-filtered, in vcf format (default config: tempdir, removed, save disk usage)_

10_graphs

File Object
dag directed acyclic graph of jobs, in pdf and png formats
rulegraph dependency graph of rules, in pdf and png formats (less crowded than above DAG of jobs, but also show less information)
filegraph dependency graph of rules with their input and output files in the dot language, in pdf and png formats (an intermediate solution between above DAG of jobs and the rule graph)

11_Reports

  • All non-empty log for each tool and each sample
  • files_summary.txt: summary of all files created by the workflow, in txt format
    (columns: filename, modification time, rule version, status, plan)

~ CONFIGURATION ~

If needed, see or edit default settings in config.yaml file in ./config/ directory

RESOURCES

samples and units aren't necessary for the moment (update in progress)

MODULES

Modules that will be loaded for each tool. Change the path to correspond to your cluster.

REFERENCE

  • ref_name: reference name used for genome mapping (default config: 'Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4')
  • path: path to genome reference
  • reference: reference sequence path used for genome mapping (default config: 'AGAMP4')
  • index: reference sequence index used for genome mapping (default config: 'Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.fai')
  • dictionary: reference sequence dictionary used by GATK's tools (default config: 'Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.dict')

PROCESSING

  • tmpdir: temporary directory (default config: '$TMPDIR')

GATK

  • genomicsdbimport: genomicsdbimport flags (default config: "")
  • genotypegvcfs: genotypegvcfs flags (default config: '-include-non-variant-sites')
  • haplotypecaller: haplotypecaller flags (default config: '-ERC GVCF --output-mode EMIT_ALL_CONFIDENT_SITES')

KNOWN VARIATION SITES

  • alleles_target: path to VCF containing known varition sites (default config: "--alleles xxxx"). Replace "xxxx" by path to the VCF containing known variation site.

BWA###

  • path: path the BWA indexes (default config: 'resources/indexes/bwa/')

FASTQ-SCREEN

  • config: path to the fastq-screen configuration file (default config: ./config/fastq-screen.conf)
  • subset: do not use the whole sequence file, but create a temporary dataset of this specified number of read (default config: '1000')
  • aligner: specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa' (default config: 'bwa')

VARANTFILTRATION

  • vqsr: not coded at this time
  • hard: settings for VCFs hardfiltering (default config:myfilter": "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0)

Glossary

  • BAM: Binary Alignment Map
  • BAI: BAM Indexes
  • FASTA: Fast-All
  • FASTQ: FASTA with Quality
  • FAI: FASTA Indexes
  • SAM: Sequence Alignment Map

Directories tree structure

πŸ–₯️️ Start_shave2_slurm.sh
πŸ“š README.md
πŸ“š LICENSE
πŸ“‚ Cluster_logs/
πŸ“‚ visuals/
 └── πŸ“ˆ DAG.png
πŸ“‚ config/
 β”œβ”€β”€ βš™οΈ config.yaml
 └── βš™οΈ fastq-screen.conf
πŸ“‚ raw/
 β”‚    β”œβ”€β”€ πŸ›‘οΈ .gitkeep
 β”‚    β”œβ”€β”€ πŸ“¦ ERR3343471_R1.fastq.gz
 β”‚    └── πŸ“¦ ERR3343471_R2.fastq.gz
πŸ“‚ resources/
 β”œβ”€β”€ πŸ“‚ genomes/
 β”‚    └── 🧬 AalbF3.fasta
 β”œβ”€β”€ πŸ“‚ indexes/
 β”‚    β”œβ”€β”€ πŸ“‚ bwa/
 β”‚    β”‚    └── πŸ—‚οΈ AGAMP4
 β”œβ”€β”€ πŸ“‚ Adapters
 β”‚    β”œβ”€β”€ 🧬 nextera
 β”‚    β”œβ”€β”€ 🧬 truseq2-pe
 β”‚    β”œβ”€β”€ 🧬 truseq2-se
 β”‚    β”œβ”€β”€ 🧬 truseq3-pe
 β”‚    β”œβ”€β”€ 🧬 truseq3-pe2
 β”‚    └── 🧬 truseq3-se  
 β”‚
 β”œβ”€β”€ πŸ“‚ results/
 β”œβ”€β”€ πŸ“‚ slurm/
 β”‚    β”œβ”€β”€ βš™οΈ config.yaml
 β”‚    └── πŸ–₯️️ status-sacct.sh   
πŸ“‚ workflow/
 β”œβ”€β”€ πŸ“‚ envs/
 β”‚    β”œβ”€β”€ 🍜 bcftools-1.14.yaml
 β”‚    β”œβ”€β”€ 🍜 bedtools-2.30.0.yaml
 β”‚    β”œβ”€β”€ 🍜 bowtie2-2.4.4.yaml
 β”‚    β”œβ”€β”€ 🍜 bwa-0.7.17.yaml
 β”‚    β”œβ”€β”€ 🍜 cutadapt-3.5.yaml
 β”‚    β”œβ”€β”€ 🍜 fastq-screen-0.14.0.yam
 β”‚    β”œβ”€β”€ 🍜 fastqc-0.11.9.yaml
 β”‚    β”œβ”€β”€ 🍜 gatk-3.8.yaml
 β”‚    β”œβ”€β”€ 🍜 gatk-4.3.0.0.yaml
 β”‚    β”œβ”€β”€ 🍜 gawk-5.1.0.yaml
 β”‚    β”œβ”€β”€ 🍜 lofreq-2.1.5.yaml
 β”‚    β”œβ”€β”€ 🍜 multiqc-1.11.yaml
 β”‚    β”œβ”€β”€ 🍜 picard-2.27.4.yaml
 β”‚    β”œβ”€β”€ 🍜 samtools-1.14.yaml
 β”‚    └── 🍜 sickle-trim-1.33.yaml
 β”œβ”€β”€ πŸ“‚ report/
 β”œβ”€β”€ πŸ“‚ rules/
 β”‚    β”œβ”€β”€ πŸ“œ calling.smk
 β”‚    β”œβ”€β”€ πŸ“œ common.smk
 β”‚    β”œβ”€β”€ πŸ“œ filtering.smk
 β”‚    β”œβ”€β”€ πŸ“œ mapping.smk
 β”‚    β”œβ”€β”€ πŸ“œ polishing.smk
 β”‚    β”œβ”€β”€ πŸ“œ stats.smk
 β”‚    └── πŸ“œ vcf_stats.smk
 β”œβ”€β”€ πŸ“‚ schemas/     
 β”œβ”€β”€ πŸ“‚ scripts/
 β”‚    β”œβ”€β”€ 🍜 report_vcf.rmd
 β”‚    └── 🍜 report.rmd
 └── πŸ“œ snakefile.smk

~ SUPPORT ~

  1. Read The Fabulous Manual!
  2. Read de Awsome Wiki! (todo...)
  3. Create a new issue: Issues > New issue > Describe your issue
  4. Send an email to loic.talignani@ird.fr

~ ROADMAP ~

  • Add a wiki!

~ AUTHORS & ACKNOWLEDGMENTS ~

  • LoΓ―c TALIGNANI (Developer and Maintener)
  • I would like to thanks Sebastien Ravel, CIRAD ingeneer, for his great help for the variant calling part.

~ CONTRIBUTING ~

Open to contributions!
Testing code, finding issues, asking for update, proposing new features...
Use Git tools to share!

~ PROJECT STATUS ~

This project is regularly updated and actively maintened
However, you can be volunteer to step in as developer or maintainer

For information about main git roles:

  • Guests are not active contributors in private projects, they can only see, and leave comments and issues
  • Reporters are read-only contributors, they can't write to the repository, but can on issues
  • Developers are direct contributors, they have access to everything to go from idea to production
    Unless something has been explicitly restricted
  • Maintainers are super-developers, they are able to push to master, deploy to production
    This role is often held by maintainers and engineering managers
  • Owners are essentially group-admins, they can give access to groups and have destructive capabilities

~ LICENSE ~

GPLv3

~ REFERENCES ~

Sustainable data analysis with Snakemake
Felix MΓΆlder, Kim Philipp Jablonski, Brice Letcher, Michael B. Hall, Christopher H. Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O. Twardziok, Alexander Kanitz, Andreas Wilm, Manuel Holtgrewe, Sven Rahmann, Sven Nahnsen, Johannes KΓΆster
F1000Research (2021)
DOI: https://doi.org/10.12688/f1000research.29032.2
Publication: https://f1000research.com/articles/10-33/v1
Source code: https://github.com/snakemake/snakemake
Documentation: https://snakemake.readthedocs.io/en/stable/index.html

Tabix: fast retrieval of sequence features from generic TAB-delimited files
Heng Li
Bioinformatics, Volume 27, Issue 5 (2011)
DOI: https://doi.org/10.1093/bioinformatics/btq671
Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176/
Source code: https://github.com/samtools/samtools
Documentation: http://samtools.sourceforge.net

GATK: A MapReduce framework for analyzing next-generation DNA sequencing data Genome Research, Volume 20: 1297-1303 (2010) DOI: https://doi.org/10.1101/gr.107524.110 Publication: https://genome.cshlp.org/content/20/9/1297 Source code:https://github.com/broadinstitute/gatk Documentation:https://gatk.broadinstitute.org/hc/en-us

**Picard-tools: ** Broad Institute, GitHub repository (2019) DOI: Publication: Source code:https://github.com/broadinstitute/picard](https://github.com/broadinstitute/picard) Documentation:https://broadinstitute.github.io/picard/

The AWK Programming Language
Al Aho, Brian Kernighan and Peter Weinberger
Addison-Wesley (1988)
ISBN: https://www.biblio.com/9780201079814
Publication:
Source code: https://github.com/onetrueawk/awk
Documentation: https://www.gnu.org/software/gawk/manual/gawk.html

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies and Heng Li
GigaScience, Volume 10, Issue 2 (2021)
DOI: https://doi.org/10.1093/gigascience/giab008
Publication: https://academic.oup.com/gigascience/article/10/2/giab008/6137722
Source code: https://github.com/samtools/samtools
Documentation: http://samtools.sourceforge.net

Fast and accurate short read alignment with Burrows-Wheeler Transform
Heng Li and Richard Durbin
Bioinformatics, Volume 25, Aricle 1754-60 (2009)
DOI: https://doi.org/10.1093/bioinformatics/btp324
Publication: https://pubmed.ncbi.nlm.nih.gov/19451168@
Source code: https://github.com/lh3/bwa
Documentation: http://bio-bwa.sourceforge.net

Trimmomatic: A sliding-window, adaptive, quality-based trimming tool for FastQ files
Joshi NA and Fass JN
_(2011)
DOI: https://doi.org/10.1093/bioinformatics/btu170
Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/pdf/btu170.pdf
Source code: https://github.com/usadellab/Trimmomatic
Documentation: https://github.com/usadellab/Trimmomatic

MultiQC: summarize analysis results for multiple tools and samples in a single report
Philip Ewels, MΓ₯ns Magnusson, Sverker Lundin and Max KΓ€ller
Bioinformatics, Volume 32, Issue 19 (2016)
DOI: https://doi.org/10.1093/bioinformatics/btw354
Publication: https://academic.oup.com/bioinformatics/article/32/19/3047/2196507
Source code: https://github.com/ewels/MultiQC
Documentation: https://multiqc.info

FastQ Screen: A tool for multi-genome mapping and quality control
Wingett SW and Andrews S
F1000Research (2018)
DOI: https://doi.org/10.12688/f1000research.15931.2
Publication: https://f1000research.com/articles/7-1338/v2
Source code: https://github.com/StevenWingett/FastQ-Screen
Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen

FastQC: A quality control tool for high throughput sequence data
Simon Andrews
Online (2010)
DOI: https://doi.org/
Publication:
Source code: https://github.com/s-andrews/FastQC
Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc