Sequence2Branches is a pipeline utilizing Snakemake designed to create a species-level phylogenetic tree from FASTQ paired reads. It handles trimming the reads, creating and annotating the assembly, identifying the species of the isolate, and creating the phylogenetic tree.
After identifying the taxonomy of the isolate, Sequence2Branches creates a phylogenetic tree selecting bacterial genomes that share the same species as the isolate from NCBI's GenBank Database to be used as leaves of the tree, as well as selecting an outgroup that shares the same genus as the isolate and leaves. The first tree, called "big tree" is made through the user selecting a number of leaves for the tree (default is 1000). A second, more readable tree, called "little tree", is also created with the user selecting the number of leaves (default is 50 and reccomended to be less than 100), but selects its leaves from the most closely related leaves to the isolate in the big tree. If you are unaware of both the taxonomic identity of your isolate and how many genomes on GenBank might belong to that species, creating the big tree, and then taking a "sub-tree" from it in the form of the little tree allows for the most accurate and efficient classification of the isolate.
The steps taken to create a phylogenetic tree from an isolate offers the building blocks to create a pangenome using the assemblies of both the isolate and tree leaves from GenBank. This is an optional step that can be ran during the initial Sequence2Branches run, or any time afterwards using the same output folder from the original run (Snakemake will not re-run any tools that already have an output folder and skip straight to the pangenome making step).
- Must be ran on Linux
- Miniconda installed and conda set up
- Mamba installed
- Install GTDB-tk as a conda environment for taxonomic identification
- Download GenBank bacteria genomes locally:
#download all genbank genomes
mkdir /path/to/genbank
cd /path/to/genbank
wget ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
#parse summary file
awk -F '\t' '{print $20}' assembly_summary.txt > assembly_summary_allftp_genomes.txt
sed -e "s/https\:\/\//ftp:\/\//g" -i assembly_summary_allftp_genomes.txt
#download all
for next in $(cat /path/to/genbank/assembly_summary_allftp_genomes.txt); do wget -P references "$next"/*genomic.fna.gz; done
git clone https://github.com/nali321/Sequence2Branches
Run conda_installer.py to create environments for pipeline.
python /path/to/conda_installer.py -o /path/to/envs
conda activate /path/to/envs/sequence2branches
This pipeline was originally designed on an HPC-server without internet access on clusters, therefore pre-installing the conda environments/dependencies before running the Snakemake pipeline was implemented. This achieves the same result as if the conda module was used on Snakemake, where the conda environments are specifically downloaded to a separate folder instead of your main envs folder.
python /path/to/sequence2branches.py -r1 /path/to/read_1 -r2 /path/to/read_2 -i /path/to/illuminaclip -s /path/to/genbank/assembly_summary.txt -r path/to/genbank/references -g /path/to/gtdbtk -e /path/to/envs -o /path/to/outdir
options:
-h, --help show this help message and exit
-r1 READ1, --read1 READ1
Filepath of first half of read pair
-r2 READ2, --read2 READ2
Filepath of second half of read pair
-i ILLUMINACLIP, --illuminaclip ILLUMINACLIP
Illuminaclip used in Trimmomatic
-s SUMMARY, --summary SUMMARY
Filepath to assembly_summary.txt
-r REFERENCES, --references REFERENCES
Filepath to references folder
-g GTDBTK, --gtdbtk GTDBTK
Filepath to GTDB-tk conda environment
-e ENVS, --envs ENVS Filepath to envs folder
-o OUTDIR, --outdir OUTDIR
Directory where output will go
###OPTIONAL PARAMETERS:
-bt BIG_TREE, --big_tree BIG_TREE
Maximum number of leaves for the big tree. Default is 1000
-lt LITTLE_TREE, --little_tree LITTLE_TREE
Maximum number of leaves for the little tree. Default is 50
-pg PANGENOME_SIZE, --pangenome_size PANGENOME_SIZE
Maximum size of the pangenome
-sc SNAKEMAKE_CORES, --snakemake_cores SNAKEMAKE_CORES
Number of cores for Snakemake to use. Default is 6