This repository contains the code to generate clusters of identical SARS-CoV-2 sequences for a subset of the overall GISAID dataset by providing a list of filtering commands. To speed up computations, we only compare sequences within their Nextclade assigned pango lineage.
GISAID data were downloaded from the nextstrain-ncov-private S3 bucket on May 3rd, 2024 using the following commands:
aws configure
aws s3 ls s3://nextstrain-ncov-private
aws s3 cp s3://nextstrain-ncov-private/metadata.tsv.zst data
unzstd data/metadata.tsv.zst
aws s3 cp s3://nextstrain-ncov-private/aligned.fasta.zst data
We use conda to define an environment containing all the required depencencies. On rhino, this requires the following steps:
# Load Anaconda
ml Anaconda3/2023.09-0
# Ensure that we have the correct conda channels (this only needs to be done once in a session)
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
We can then use conda to install the required dependencies (~ 15 minutes). The environment can be created and activated using the following commands:
# Install
conda env create -f envs/idseq.yaml
# Activate
source activate idseq
If needed, the environment can be deactivated / deleted using the following command:
# Deactivate
conda deactivate
# Remove
conda env remove --name idseq
The workflow is split into two parts:
generate_config
that will list all the pango lineages present in the metadata.id_seq
that will generate clusters of identical sequences for the config defined inconfig/config_idseq.yaml
using the pango lineages summarised by thegenerate_config
part of the workflow.
The workflow can launched directly from the commandline using:
# Chaneg the working directory
cd generate_config/
# Activate the environement
source activate idseq
# Ensure snakemake is loaded
ml snakemake/7.18.2-foss-2021b
# Launch workflow
snakemake --use-conda -j 10 --profile ../profiles/
The workflow can launched directly from the commandline using:
# Change the working directory
cd id_seq/
# Activate the environement
source activate idseq
# Ensure snakemake is loaded
ml snakemake/7.18.2-foss-2021b
# Launch workflow
snakemake --use-conda -j 10 --profile ../profiles/
or using a bash file to launch a job that will initiate the scheduling (here on the Hutch cluster):
sbatch launch_schedule.sh