SARS-CoV-2 wastewater surveillance using Nanopore long reads. Version 1.1
This pipeline aims to identify and quantify SARS-CoV-2 lineages at the read-level.
The recommended way to install and use the pipeline is by using the pre-built docker image
docker pull ghcr.io/garcia-nacho/wastewater:v1.1
However, you could download and edit this repo to create a custom analysis yourself:
git clone https://github.com/garcia-nacho/Wastewater_SARS-CoV-2/
docker build -t wastewater Wastewater_SARS-CoV-2
The pipeline accepts several parameters that are parsed to the image via the flag -e parameter=value
Quality qual=value
The pipeline will filter the reads based on a quality cut-off.
The default quality threshold is 15. But a quality limit of 10 also gives good results.
A value of -1 will prevent any filtering.
Minimum Size m=value
The pipeline will filter reads with a size lower than the minimum value.
The default minimum size is 500nt
Maximum Size M=value
The pipeline will filter reads with a size higher than the maximum value.
The default minimum size is 1300nt
Noise noise=value
The pipeline will analyze the positions with a noise higher than the value provided by this parameter.
The default noise level is 0.15 but values between 0.05 and 0.10 could give good results.
A very high noise parameter will cause fewer sites analyzed and therefore some mutations might be ignored.
A very low noise parameter will cause the inclusion of more sites, which would cause a higher running time and might reduce the performance of the analysis.
Start start=value
This parameter defines where the sequenced region into the Spike gene starts.
The default start position is 1250
End end=value
This parameter defines where the sequenced region into the Spike gene ends.
The default start position is 2250
Trimming trim=value
This parameter specifies if trimming is needed to remove the primers and how many nucleotides should be trimmed.
The default value is no trimming (trim=0)
Positions of Interest poi="pos1,pos2,pos3"
This parameter defines which positions should be analyzed disregarding the noise level on them. The accept the standard mutation format considering the entire SARS-CoV-2 genome (e.g. "G22895C,T22896A,G22898A,A22910G')
The default value is empty.
Kmer kmer="kmer1,kmer2,kmer3
This parameter allows you to screen for kmers in the reference and the reads.
The default value is empty
The pipeline expects this folder structure:
./ExpXX |-Sample1 |-File_XXXXX_1.fastq.gz |-File_XXXXX_2.fastq.gz |-File_XXXXX_3.fastq.gz |-... |-Sample2 |-File_XXXXX_1.fastq.gz |-File_XXXXX_2.fastq.gz |-File_XXXXX_3.fastq.gz |-... |-...
The filename of the .fastq.gz files are irrelevant and the samples are named using the folder that containes them as name
Note that older versions of docker might require the flag --privileged
to run properly.
It is possible to integrate previous results of the pipeline to be analyzed again using the information of the new samples.
To integrate the results of old analyses, a directory containing the files with the suffix _b2f.tsv.gz should be mapped into the /Previous folder of the docker container by using the following flag.
-v Path/to/database:/Previous
Basic run using default settings:
docker run -it --rm -v $(pwd):/Data ghcr.io/garcia-nacho/wastewater:v1.1
Read filter by a quality of 10:
docker run -it --rm -e qual=10 -v $(pwd):/Data ghcr.io/garcia-nacho/wastewater:v1.1
Noise cut-off change to 0.1:
docker run -it --rm -e noise=0.1 -v $(pwd):/Data ghcr.io/garcia-nacho/wastewater:v1.1
Change the read size to allow reads between 100 and 2000nt
docker run -it --rm -e m=100 -e M=2000 -v $(pwd):/Data wastewater
Look into mutations of the BA.2.86 variant and use the previous results included in the directory WWDB
docker run -it --rm -v WWDB:/Previous -v $(pwd):/Data \
-e poi='G22895C,T22896A,G22898A,A22910G,C22916T,G23012A,C23013A,T23018C,T23019C,C23271T,C23423T,A23604G' \
ghcr.io/garcia-nacho/wastewater:v1.1
Look into a kmer specific from the BA.2.86 variant and use the previous results included in the directory WWDB
docker run -it --rm -v WWDB:/Previous -v $(pwd):/Data -e kmer='TAAGCATAGTG' ghcr.io/garcia-nacho/wastewater:v1.1
The pipeline generates several folders: analysis, bam, QC, sequences and WWDB
analysis
Here you will find:
-Sankey plots showing the most abundant combinations of mutations at nucleotide and amino acid levels
-Barplots showing the relative abundance of the different combinations. Different granularity levels are included (amino acid sequence level, pangolin level, mutation-based-lineage)
-Excel files with the raw results
-A barplot showing the relative abundance of all the single mutations found in the samples.
-html widgets showing the distribution of the different lineages found in the different samples and in the entire dataset.
bam
Here you will find the bam files containing the reads aligned against the Spike gene of the wuhan-hu-1 strain
QC
Here you will find:
-Plots showing the coverage of the samples over the spike gene
-Noise at the different positions (Noise is defined here as ratio of bases not included in the consensus)
-A noise.tsv file that contains the raw data regarding coverage and noise
-consensus_qual.txt. Quality of the different bases called in the consensus
sequences
Here you find the fasta sequences of the consensus and variants found in each sample
WWDB Here you will find the files required to integrate the results into future analyses.
The pipeline filters the reads according to a quality cut-off using seqkit. Reads are mapped against the reference Spike using minimap2 and filtered and sorted using samtools. The resulting bam file is indexed using samtools. The positions showing a mix of bases are identified using noisefinder and the bases connected with their read-ids are retreived using bbasereader. All positions are merged using the read-id as pivot column and the different variants are identified and saved in a fasta file. The pangolin lineages are assigned using the mutations on the references and reads. If there is no discrepancy between the mutations found in a read and the mutations found in a lineage, the read is assigned to that lineage. If there are discrepancies the reads are assigned to the closer lineage and the discrepancies are displayed. If there are several lineages close enough to the read in terms of mutations, all of them are displayed (Note that not all mutations are analyzed, only those dynamically decided and those defined with the poi flag). The plots and analyses are generated in R