/FHI_SC2_Pipeline_Illumina

Bioinformatic pipeline for SARS-CoV-2 sequence analysis

Primary LanguageR

FHI's SARS-CoV-2 Illumina Pipeline V13

Linux Docker

Version 13 of the bioinformatic pipeline for SARS-CoV-2 sequence analysis used at the Folkehelseinstituttet

⚠️ V12 is still accesible via dockerhub: garcianacho/fhisc2:IlluminaV12

Description

Docker-based solution for sequence analysis of SARS-CoV-2 Illumina samples

Primer schemes supported

ArticV3
ArticV4

Installation

git clone https://github.com/folkehelseinstituttet/FHI_SC2_Pipeline_Illumina/
cd FHI_SC2_Pipeline_Illumina
docker build -t garcianacho/fhisc2:Illumina .

Running the pipeline

ArticV4:
docker run -it --rm -v $(pwd):/home/docker/Fastq garcianacho/fhisc2:Illumina SARS-CoV-2_Illumina_Docker_V13.sh ArticV4

ArticV3:
docker run -it --rm -v $(pwd):/home/docker/Fastq garcianacho/fhisc2:Illumina SARS-CoV-2_Illumina_Docker_V13.sh ArticV3

Note that older versions of docker might require the flag --privileged and that multiuser systems might require the flag -u 1000 to run

The script expects the following folder structure where the fastq.gz files are placed inside independent folders for each Sample

./ExpXX    
  |-ExperimentXX.xlsx      
  |-Sample1     
      |-Sample1_SX_LXXXX_R1.fastq.gz       
      |-Sample1_SX_LXXXX_R2.fastq.gz      
  |-Sample2      
      |-Sample2_SX_LXXXX_R1.fastq.gz   
      |-Sample2_SX_LXXXX_R2.fastq.gz   
  |-Sample3   
      |-Sample2_SX_LXXXX_R1.fastq.gz   
      |-Sample2_SX_LXXXX_R2.fastq.gz
  |-...   

The script also expects a .xlsx file, that contains information about the position of the samples on a 96-well-plate and the DNA concentration (alternatively this column can be used for the Ct-values). If the file is not properly formated the script will run without errors but the Quality-control plot will not be generated or it will contain errors. Note that the script takes the name of the experiment from the name of the xlsx file. If the file is not found the names of the output files might be incorrect. It is possible to download a template of the xlsx file here

Outputs

👉 (V13)-Identification of recombinants (see Precfinder for details)
👉 (V13)-Identification of contaminants (see Precfinder for details) -Summary including mutations found, pangolin lineage, number of reads, coverage, depth, etc...
-Bam files
-Consensus sequences
-Aligned consensus sequences
-Consensus nucleotide sequence for gene S
-Indels and frameshift identification
-Quality-control plot for the plate to detect possible contaminations
-Phylogenetic-tree plot of the samples
-Noise during variant calling across the genome
-Quality-control for contaminations/low-quality samples
-Amplicon efficacy of the selected primer-set for all the samples

Under the hood

This pipeline is based on the FHI's base docker image which bundles all linux-packages required by the bioinformatic tools plus R v4.1.1. On top of the base image lays a second docker image containing all bioinformatic tools required (e.g. Tanoti, nextclade, ivar, etc). The final docker image is based on the bioinformatic-image plus the Scripts and CommonFiles required to run.
If you want, you can rebuild the two images using the Dockerfiles located on the fhibase and fhibaseillumina folders.
Note that rebuilding the images can lead to broken dependencies since they used public repositories.