A nexflow pipeline for short variant calling from NGS data
Whole genome sequencing data are used for various purposes including molecular typing, classification and phylogenetic analysis. The tuber
pipeline processes NGS data to obtain short variants (SNVs and INDELs). This pipeline implements tools for raw data quality control and pre-processing, read mapping, variant calling and filtering to produce a ready to use vcf and multiple sequence alignment files. The tuber pipeline includes workflow of GATK Best Practices for germline short variant discovery, link, which allows users to work on large datasets. In addition to the GATK Best Practices, tuber includes a step to parse the SNV table into a multiple sequence alignment in fasta format. (tuber
runs well under Linux OS and MacOSX; for Windows users tuber
also runs well under linux in a virtual machine or Windows subsystem for linux)
The pipeline needs the following files as the inputs
1. Raw reads; file name must match the following scheme: _{1,2}.fastq.gz
2. Reference genome sequence (fasta format)
3. (optional) Interval lists (BED format); formatted as 3-column format
The pipeline will produce outputs into 5 separate directories including aln, bam, multiqc_fastqc, vcf and vcf_joint under the directory /user_defined_path/results. Here is a brief description of output files:
File | Description |
---|---|
before.html | quality control check before trimming- html summary |
after.html | results quality control check after trimming using Trimmomatic with defined parameters |
bam | binary alignment map - binary version of SAM (sequence alignment/map) |
coverage.tsv | summary table of mean depth and genome coverage (numreads, covbases, coverage, meandepth, meanbaseq) |
*.g.vcf.gz | per-sample vcf |
joint_filtered.vcf | joint-called vcf filtered with default or user defined filter options |
aln.fasta | multiple sequence alignment in fasta format (concatenated snp alignment) |
pos.txt | position of polymorphic sites corresponding positions in the reference genome sequence |
-
Install dependencies listed in
table 1
. -
Download the
tuber
pipeline scripts.$ git clone https://github.com/b600a/tuber.git $ cd tuber
-
Specify paths and parameters for
nextflow
(listed intable 2
) in thetuber.sh
file using text editor.
Example:nextflow run tuber.sh -resume --ref '/home/bb/ref/ref.fasta' --reads '/home/bb/reads/*_{1,2}.fastq.gz' --results /home/bb/output_here/ --ref_genome 'NC_000962.3' --stop 'true'
-
Lauch the pipeline execution with the following command:
$ ./tuber.sh or $ sh tuber.sh
Paramters that you have specified, and and processes and its progress will be shown as figure below. For more details on howtuber
works, please find next section.N E X T F L O W ~ version 20.07.1 Launching `main.nf` [ridiculous_lorenz] - revision: f29238c695 tuber v0.2 ==================================================== Location of reference genome file: /home/bb_20/ref/ref.fasta Reference regenome name: NC_000962.3 Location of raw read files (fastq): /home/bb_20/reads/*_{1,2}.fastq.gz Location of results: /home/bb_20/output Trimming option (for trimmomatic): SLIDINGWINDOW:4:30 MINLEN:70 Mapping option (for BWA MEM): -c 100 -M -T 50 Option for HaplotypeCaller (per-sample variant calling): -ploidy 1 -mbq 20 Option for variant filtering (using VariantFilatration): QD < 2.0 || MQ < 40.0 (Optional) Stop after finishing step 3 (per-sample variant calling by HaplotypeCaller): false ===================================================== executor > local (15) [0a/10fa43] process > PREPARE_GENOME_SAMTOOLS (ref) [100%] 1 of 1 ✔ [63/a74c11] process > PREPARE_GENOME_PICARD (ref) [100%] 1 of 1 ✔ [0c/2dbb31] process > PREPARE_GENOME_BWA (ref) [100%] 1 of 1 ✔ [71/d4bb14] process > FASTQC_BEFORE_TRIM (ERR1034689) [ 80%] 4 of 5 [- ] process > MULTIQC_FASTQC_BEFORE_TRIM - [f7/9ffdf3] process > TRIM (ERR1034687) [ 80%] 4 of 5 [99/ee416e] process > FASTQC_AFTER_TRIM (ERR1034686) [ 0%] 0 of 4 [- ] process > MULTIQC_FASTQC_AFTER_TRIM - [5c/7882e7] process > READ_MAPPING_BWA (ERR1034686) [ 0%] 0 of 4 [- ] process > COVERAGE_OUTPUT - [- ] process > CALL_VARIANTS - [- ] process > CREATE_SAMPLE_MAP - [- ] process > JOINT_GENOTYPING - [- ] process > FILTER_VARIANTS - [- ] process > VCF_TO_FASTA - `
Once the run has finised,tuber
will report duration time that was used (Example shown below). The pipeline will generate cache files in directory/tuber/work/
, you can delete this directory.N E X T F L O W ~ version 20.07.1 Launching `main.nf` [ridiculous_lorenz] - revision: f29238c695 tuber v0.2 ==================================================== Location of reference genome file: /home/bb_20/ref/ref.fasta Reference regenome name: NC_000962.3 Location of raw read files (fastq): /home/bb_20/reads/*_{1,2}.fastq.gz Location of results: /home/bb_20/output Trimming option (for trimmomatic): SLIDINGWINDOW:4:30 MINLEN:70 Mapping option (for BWA MEM); -c 100 -M -T 50 Option for HaplotypeCaller (per-sample variant calling); -ploidy 1 -mbq 20 Option for variant filtering (using VariantFilatration); QD < 2.0 || MQ < 40.0 (Optional) Stop after finishing step 4 (per-sample variant calling by HaplotypeCaller): false ===================================================== executor > local (35) [0a/10fa43] process > PREPARE_GENOME_SAMTOOLS (ref) [100%] 1 of 1 ✔ [63/a74c11] process > PREPARE_GENOME_PICARD (ref) [100%] 1 of 1 ✔ [0c/2dbb31] process > PREPARE_GENOME_BWA (ref) [100%] 1 of 1 ✔ [c5/ecda54] process > FASTQC_BEFORE_TRIM (ERR1034688) [100%] 5 of 5 ✔ [66/1b5299] process > MULTIQC_FASTQC_BEFORE_TRIM [100%] 1 of 1 ✔ [9d/5fbaae] process > TRIM (ERR1034688) [100%] 5 of 5 ✔ [66/4a654e] process > FASTQC_AFTER_TRIM (ERR1034688) [100%] 5 of 5 ✔ [ba/b8909b] process > MULTIQC_FASTQC_AFTER_TRIM [100%] 1 of 1 ✔ [c6/d877d1] process > READ_MAPPING_BWA (ERR1034688) [100%] 5 of 5 ✔ [01/b30c49] process > COVERAGE_OUTPUT [100%] 1 of 1 ✔ [be/aa02d7] process > CALL_VARIANTS (ERR1034685) [100%] 5 of 5 ✔ [1c/ebed9d] process > CREATE_SAMPLE_MAP [100%] 1 of 1 ✔ [b2/6117d7] process > JOINT_GENOTYPING (1) [100%] 1 of 1 ✔ [3b/806024] process > FILTER_VARIANTS (1) [100%] 1 of 1 ✔ [b0/03d1d3] process > VCF_TO_FASTA (1) [100%] 1 of 1 ✔ Completed at: 02-Nov-2020 12:36:48 Duration : 11m 11s CPU hours : 0.5 Succeeded : 35 Time used: 00:00:00
Trimmomatic
will be used for clipping Illumina adapters and trimming low quality (sequences in the) reads. Reads that are shorter than the defined length will be dropped. By default it is set as: -PE -phred33, ILLUMINACLIP:TruSeq3-PE.fa:2:30:10, SLIDINGWINDOW:4:30, MINLEN:70
Fastqc
will be used for QC checking sequencing reads before and after processing with ``Trimmomatic. Then MultiQC will parse summary stats from fastqc results of each sample to produce summary tables (multiqc_fasqc.txt and multiqc_general_stats.txt and HTML reports in multiqc_fastqc/ in output directory.
Processed reads will be aligned to the reference genome sequence with BWA MEM
. Default is set as: -c 100 -M -T 50. SAM files will be converted into BAM files by Samtools
and duplicate reads will be marked by Picard
. Average depth and genome coverage of each sample are calculated from BAM using Samtools, results will be parsed into coverage.tsv file in bam/ in output directory.
GATK HaplotypeCaller
will use BAM files as input for calling short variants (SNVs and INDELs) in GVCF mode. Default options are set as minimum base quality 20 and ploidy 1. This step will generate an intermediate file (GVCF) per-sample, which will be used for joint genotyping to produce a final multi-sample VCF file. A list of GVCF files generated will be parsed into tab-delimited text file (named as sample_map.txt) in the input directory. By default tuber will proceed to the next step, all samples will be imported into the workspace for joint genotyping. You can also choose to stop tuber at this step by specifying --stop true in tuber.sh file. You can check the results of MultiQC (HTML reports) and Samtools (coverage.tsv) to select only samples that meet the criteria (e.g. minimum read quality, minimum mean depth, minimum genome coverage) before joint-genotyping. If this parameter is left as default setting, tuber will proceed to step 4.
GATK GenomicsDBImport
will create a GenomicsDB workspace and then import single-sample GVCFs into this workspace. GATK GenotypeGVCFs then will perform joint genotyping on GenomicsDB workspace to produce a multi-sample VCF. By default, GenomicsDBImport
will read sample_map.txt to import all samples into the workspace. In case users have stopped the run after finishing step 3 for checking the results from steps 1 and 2 to exclude some problematic or low quality samples before proceeding to next steps. If you want to proceed to the next step without any change, use command ./sh tuber.sh to do so. But if users want to include only some samples, users will need to provide a tab-delimited text file in input directory (formatted as shown below and file must be named sample_map_usr.txt)
Sample_1 Sample_1.g.vcf.gz
Sample_2 Sample_2.g.vcf.gz
Sample_3 Sample_3.g.vcf.gz
. .
. .
Sample_n Sample_n.g.vcf.gz
GenomicsDBImport
then will automatically read sample_map_usr.txt file and import only samples that are listed into the workspace. After saving sample_map_usr.txt file to the input directory, execute command ./sh tuber.sh to start joint genotyping.
By default, INDELS will be filtered with GATK SelectVariants
(-select-type SNP). Remaining SNVs will be filtered by GATK VariantsFiltration
with filter settings: QD < 2.0 || MQ < 40.0 (Filtered VCF file will be generated in vcf_joint directory in output directory (named as joint_filtered.vcf.gz)
Filtered VCF will be parsed to produce a SNV table using GATK VariantsToTable
. SNV table will be parsed into FASTA format (file named aln.fasta) in aln directory in output directory. Position of all SNVs (in reference coordiates) will also be parsed into a separate file named as pos.txt in the same directory.
Name |
---|
Nextflow 20.07 (or later) |
Java 8 |
Fastqc |
Multiqc |
Trimmomatic |
BWA |
Samtools |
Bcftools |
Gatk 4.1.8.1 (or later) |
Datamash |
Paths/Paramter | Paramter name | Default |
---|---|---|
params.reads |
the location of the read fastq files | /home/reads/*_{1,2}.fastq.gz |
params.ref_genome |
the location of the read fastq files | /home/ref/reference.fasta |
params.results |
the location where the results will be stored | /home/bb_20/out/ |
params.genome_name |
genome name | NC000962.3 |
params.trimming_option |
parameter for trimming (Trimmomatic) | SLIDINGWINDOW:4:30 MINLEN:70 |
params.mapping_option |
parameter for read mapping (BWA) | -c 100 -M -T 50 |
params.haplotypecaller_option |
parameter for per-sample variant calling (HaplotypeCaller) | -ploidy 1 -mbq 20 |
params.genomicsdbimport_option |
parameter for joint (GenomicsDBImport) | test |
params.genotypegvcfs_option |
parameter for joint (GenotypeGVCFs) | test |
params.variant_filter |
parameter for variant filtering | QD < 2.0 MQ < 40.0 |
params.variant_filter_name |
parameter for variant filtering (VariantsFiltration) | qd-mq |
params.selectvariant_option |
parameter for variant filtering (SelectVariants) | --exclude-filtered -select-type SNP |
params.stop |
optional stop after step 3 (per-sample variant call) | false |
1.1 Install curl
and git
:
$ sudo apt-get update
1.2 Install homebrew
:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
1.2.1 Set PATH:
$ echo 'eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv)' >> /home/bb_18/.profile
1.2.2 Evaluate .profile
:
$ eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv)
1.3 Install build-essential
:
$ sudo apt-get install build-essential
1.4 Install gcc
:
$ brew install gcc
1.5 Tap brewsci/bio
:
$ brew tap brewsci/bio
$ brew install nextflow \
> gatk \
> fastqc \
> samtools \
> bwa \
> datamash \
> bcftools \
> trimmomatic \
> samtools
2.1 Launch nexflow
to install its dependencies by running $ nextflow
$ pip3 install multiqc