COVID-19 Genome Analysis Pipeline, for whole genome sequencing and annotation of SARS-CoV-2 {basic version, more on the way}
COVGAP will process the raw demultiplexed short reads (Illumina) produced by an amplicon sequencing approach targeting SARS-CoV-2 (e.g. ARTIC) COVGAP will quality filter your reads and map them against the reference SARS-CoV-2 genome, providing the variants in vcf format, and include them in a draft consensus genome. Only the variants showing high coverage and high alternative frequency (primary alleles variants) are included in the consensus. Alleles below the coverage or frequency thresholds (secondary alleles variants), and those alleles occurring as alternative to primary allele variants (alternative allele variants) are stored and annotated separately for the user consultation, but will NOT be part of the consensus. Loci showing not enough coverage to allow a confident variant call are masked with N, instead of calling the reference. The consensi are labelled according to a user defined threshold of ambiguous base count (N) into LowCov (Ns above threshold), HighCov (Ns below threshold).
- git >= 1.8.3
- python >= 3.7
- conda >= 4.9 (for mac only)
- snakemake >= 5.26
In case you don't have snakemake already, we recommend to install it via mamba:
$ conda install -c conda-forge mamba
then to install the full package run:
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
more options can be found at the Snakemake page. Finally source the environment:
$ conda activate snakemake
Portability expansion is on its way, for the moment, we reccommend to clone the repository in a directory of choice:
$ git clone https://github.com/appliedmicrobiologyresearch/covgap/ path/to/covgap
The initial reads need to be already demultiplexed, paired end and tagged with _R1.fastq.gz
and _R2.fastq.gz
. Anything prior this tagging will be interpreted as the sample name. For the moment COVGAP can run only in linux or macOS. Choose the appropriate version while running snakemake. To run the command with default options simply type:
$ snakemake -s path/to/covgap/repo/covgap_linux.smk --use-conda --cores [number of cores reserved]
$ snakemake -s path/to/covgap/repo/covgap_mac.smk --use-conda --cores [number of cores reserved]
Parameters are customisable by adding the flag --config
followed by the parameter to be changed. For example, to change the directory containing reads from the default to ../my_reads/
, the command will be:
$ snakemake -s path/to/covgap/repo/covgap_mac.smk --use-conda --cores 4 --config read_dir=../my_reads/
A full list of the parameters follows:
Parameter | Description | Default |
---|---|---|
read_dir | [path] Directory containing the demultiplexed raw reads | reads/ |
QC_sliding_windows | [numeric] Sliding window used to scan the 5‟ end. It clips the read once the average quality within the window falls below a threshold | 4 |
QC_phred_score | [numeric] Minimum quality threshold per base (phred score) used to trim the read (see above sliding window) | 20 |
adapters | [file.fasta] The sequence of adapters to be trimmed | Nextera Flex adapters |
primers | [file.fasta] The sequences of forward and reverse primers used for the amplicon sequence | ARTIC primers V3 |
ref_genome | [file.fasta] The full SARS-CoV-2 genome used as reference, NOTE: if using a custom reference, make sure it is indexed before the run | Wuhan-Hu-1 (NC_045512.2) |
mapr | [file.json] Mapping criteria to consider a read as mapping | map, mate_mapped |
unmapr | [file.json] Unmapping criteria to consider a read as NOT mapping | unmapped, mate_unmapped |
uptrim_threshold | [numeric] Upperbound threshold to prune the read pileup if above the threshold. NOTE: this trimming is used for plot display only. It is NOT affecting the variant call | 1000 |
variant_freq | [numeric] Minimum alternative frequency in the read pileup to consider a variant as primary. We reccommend to choose values higher than 0.7 NOTE: they will be considered primary only those variants showing a variant_freq AND a variant_depth above threshold | 0.7 |
variant_depth | [numeric] Minimum locus depth in the read pile up to consider a variant as primary. NOTE: they will be considered primary only those variants showing a variant_freq AND a variant_depth above threshold | 50 |
n_threshold | [numeric] Ambiguous base call percentage overall the consensus to consider the genome HighCov or LowCov | 10 |
File | Description |
---|---|
result/[sample]/Mapping/[sample].alignment.removed.duplicates.mapped.reads.only.sorted.[High/Low]Cov.bam | final alignment containing only reads mapping the reference |
result/[sample]/Mapping/[sample].alignment.stats.tab | stats containing the amount of mapped vs unmapped reads |
result/[sample]/Mapping/[sample].average.coverage.tab | stats containing the average coverage for the sample and its standard deviation |
result/[sample]/variantcall/[sample].final.vcf | the final vcf including all the primary alleles being found |
result/[sample]/variantcall/[sample].minority_alleles.vcf | the final vcf including all the minority alleles being found |
result/[sample]/variantcall/[sample].minority_alleles_report.tech | additional report on minority alleles, featuring the denomination, the frequency score, as well as depth |
result/[sample]/variantcall/[sample].consensus.[High/Low]Cov.fasta | the consensus genome featuring all the primary allele variants |
result/[sample]/variantcall/[sample].classification.tab | a statement file including the classification of the sample and its percentage of ambiguous basecalls |
Alfredo Mari: alfredo.mari@unibas.ch
Please refer to the following publications to properly cite COVGAP:
- Stange M, Mari A, Roloff T, Seth-Smith HM, Schweitzer M, Brunner M, et al. (2021) SARS-CoV-2 outbreak in a tri-national urban area is dominated by a B.1 lineage variant linked to a mass gathering event. PLoS Pathog 17(3): e1009374. https://doi.org/10.1371/journal.ppat.1009374
- Mari A, Roloff T, Stange M, Søgaard KK, Asllanaj E, Tauriello G, Alexander LT, Schweitzer M, Leuzinger K, Gensch A, Martinez AE, Bielicki J, Pargger H, Siegemund M, Nickel CH, Bingisser R, Osthoff M, Bassetti S, Sendi P, Battegay M, Marzolini C, Seth-Smith HMB, Schwede T, Hirsch HH, Egli A. Global Genomic Analysis of SARS-CoV-2 RNA Dependent RNA Polymerase Evolution and Antiviral Drug Resistance. Microorganisms. 2021; 9(5):1094. https://doi.org/10.3390/microorganisms9051094