Authors: | J. Blackwell, M. McDonald |
---|---|
Creation Date: | 25-02-2021 |
Last updated: | 10-02-2022 |
License: | Apache 2.0 with commons clause v1.0 |
Contact: | j.blackwell@warwick.ac.uk, m.c.mcdonald@bham.ac.uk |
Overview: Workflow tool designed identify and analyse genes of interest within small haplotype genomes. The workflow will assemble haploid genomes de novo from short paired-end reads, create a local BLAST database and query genes of interest. Where genes of interest are detected, a multiple alignment is performed for each gene.
Citation: This workflow has been adapted and improved from this publication: McDonald MC, Renkin M, Spackman M, et al. Rapid Parallel Evolution of Azole Fungicide Resistance in Australian Populations of the Wheat Pathogen Zymoseptoria tritici. Appl Environ Microbiol. 2019;85(4):e01908-18. Published 2019 Feb 6. doi:10.1128/AEM.01908-18
- Method
- Inputs
- Steps
- Outputs
- Example Workflow
- File Structure
- Usage/Tutorial
- Required Software
- Miniconda3
- Mamba
- Snakemake
- Input Data
- Paired-end Reads
- Adapters fasta file
- Target Genes fasta file
- Running the Workflow
- Setup
- Executing the Workflow
- Specifying Target Files
- Parameters
- Required Software
- Rules List
- Short-read paired end fastq files.
- Adapter fasta file.
- Query gene fasta files.
- Trimmed fastq files.
- Sample denovo genome assembly.
- Local genome blast database.
- Extracted BLAST hits for query genes.
- Multiple alignment between samples for query genes.
- Trimming of raw fastq files [trimmomatic]
- Requires Inputs 1 and 2
- Creates Output 1
- Quality control [Not yet implemented]
- Denovo assembly [SPAdes]
- Output 1 is automatically input into this step
- Output 2, contigs.fasta is created in Spades output folder
- Create local BLAST databases for each de novo assembly [BLAST+]
- Output 2 is automatically used as input for this step
- Creates Output 3
- Query each assembly BLAST database with gene of interest [BLAST+]
- Requires Input 3
- Uses Input 3 and Output 3 to BLAST gene of interest
- Extract sequence from the database [BLASTtoGFF]
- Parses BLAST output file with custom Python script called BLASTtoGFF.py
- Creates Output 4 in fasta format
- Perform multiple alignment for each identified gene [MAFFT]
- Takes Output 4 results from all isolates and creates multiple sequence alignment
For a full list of Snakemake rules and directed graph of workflow, see the Rules List section.
Included in this repository is a simplified set of example data that is analysed according to the directed acyclic graph shown below. This example includes two isolates, one isolate was sequenced on a single lane, the other isolate was sequenced accross two lanes. As input, there are two reduced sets of paired-end sequences (one of which is split over multiple lanes), an example adapter file and an example gene query.
After running the workflow to completeion you should expect to see the folowing file structure. If you plan on using an intermediate file such as an assembled genome as an input, run the example workflow once to generate the full file structure.
SAHDA
├── README.md # Workflow users guide
├── config # Config file for customising the workflow
├── images # Images for the README.md document
├── logs # Lists of errors and standard output from the workflow
├── resources
│ ├── adapters # Adapter files
│ ├── query # Query sequences
│ └── reads # Sequencing libraries
├── results
│ ├── 00_append # Appended multi-lane libraries
│ ├── 01_trim # Trimmed libraries
│ ├── 02_assemble # Assembled Genomes
│ ├── 03_blastdb # Indexed Blast databases of assemblies
│ ├── 04_search # BLAST search results
│ ├── 05_retrieve # Retreived sequences of BLAST hits
│ ├── 07_combine # Combined retreived sequences
│ └── 08_align # Multiple sequence alignments
└── workflow
├── envs # Specifications for software downloads
└── scripts # Custom Python scripts
You can download this entire repository as a zip folder (clone dropdown menu) or by using GIT commands (if you know what those are).
This workflow requires the installation of miniconda3 and snakemake. It is recommended that mamba is also installed. The workflow will then utilise conda to manage any additional software requirements.
These are the methods, as recommended by the snakemake documentation for installing the software required for this workflow.
Follow the miniconda3 installation instructions. This will install the latest version of the Conda package, environment management software and it's minimal dependancies.
-
Install miniconda3 and not miniconda2
-
Answer YES when asked if conda should be put into and manage your PATH
Install the lastest version of conda, however, if any issues are encountered revert to conda version 4.9.2 used to build this workflow. Conda can manage it's own installation version.
Conda is python based and can sometimes run slow or select older software versions. mamba is a c++ replacement for conda's package management capabilities. Install mamba using conda via the commandline.
conda install -c conda-forge mamba
Again, install the lastest version. If issues are encountered this workflow was created using mamba v0.7.8
Snakemake is a python based workflow management language for reproducible and scalable data analysis. It is recommended to recreate the snakemake environment used to develop this workflow.
To do so, navigate using the commandline, to the top level of the workflow directory (containing this README.md file). Use the following command, subsituting the file path with your own.
cd <path>/<to>/SAHDA/
Recreate the snakemake environment from the environment.yml "env_snakemake_mgw.yml"
file in the config folder.
mamba env create -f ./config/env_snakemake_mgw.yml
Activate the recreated snakemake environment.
conda activate env_snakemake_mgw
In order to return to the base conda environment input the following command. But be sure to re-activate the env_snakemake_mgw before runnnig the workflow.
conda activate base
Paired end files should be added to the resources/reads
directory, and the sample name added to the config file described later.
This workflow will accept paired-end short read sequences with filenames of the following formats.
<sample_name>_<R1/R2>.fastq.gz
<sample_name>_<lane_no>_<R1/R2>.fastq.gz
SRRXXXXXXX_R1.fastq.gz
AAAXXXX_L001_R1.fastq.gz
Note: The workflow only supports samples spread over two lanes. i.e. only L001 and L002 are currently supported.
The adapter file, in .fasta
format, should be added to the resources/adapters
file and the full filename specified in the config/config.yml
file.
Query genes should be added as individual .fasta
files to the resources/query
folder with the format <gene_query>.fasta
.
The workflow configuration file config/config.yml
can be edited to specify parameters for the workflow. Including samples, query genes and trimming options.
Sample names can contain additional underscores, so long are present before
_L00X_RX.fastq.gz
or_RX.fastq.gz
and are specified in the config.
To run the workflow, navigate using the commandline to the top level directory (containing this README.md file).
cd /path/to/SAHDA/
Activate the workflow's snakemake environment.
conda activate env_snakemake_mgw
If the snakemake enviroment cannot be activated, ensure that the environment was initially created during initialisation with the
mamba env create -f ./config/env_snakemake_mgw.yml
command.
Add the data required for the analysis to the appropriate resources folders.
Edit the config.yml file to specify the names of samples and query genes.
Sample names can contain additional underscores, so long are present before
_L00X_RX.fastq.gz
or_RX.fastq.gz
and are specified in the config.
To execute the default workflow run the following command.
snakemake --cores all --use-conda
Note this will use the maximum cores available on your local machine. To specify the number of cores use
--cores N
where N is the maximum number of cores to be assigned.
Snakemake will automatically determine which jobs need to be run to generate all target files. Meaning that output files that are already present will not be regenerated.
To specify a specific set of output files, follow the snakemake command with a target rule to specify what steps to perform. For example:
snakemake --cores all --use-conda target_trim
The following is the list of target rules:
target_trim # Trim fastq files
target_assemble # Assemble denovo genome
target_blastdb # Create local blast database
target_search # Search local database for gene query
target_retrieve # Retrieve identified hit sequences
target_condense # Combine retrieved sequence files
target_align # Align retrieved sequences
For a full list of snakemake parameters see the snakemake documentation.
[--use-conda
] Required This workflow should always be run with with --use-conda
flag to enable each rule to download and utilise a specific conda environment. This ensures a reproducible workflow, but also ensures that required software will not conflict.
[-j --jobs --cores
] Required Specify the maximum number of CPU cores to use in parallel. Specify all
to use all cores available.
[-n --dryrun
] Dry run the Snakefile
with -n
prior to performing an analysis to check for potential errors.
[-p --printshellcmds
] Print the shell commands to be executed to the terminal.
[-r --reason
] Print snakemake's reason for executing each job.
[-R --forcerun
] List the rules that should be re-run. Snakemake does not re-perform analysis if new files are added prior to a step that combines outputs from previous steps. For example if new data has been introduced prior to a multiple alignment. In these instances -R mafft-align
would need to be specified. A full list of snakemake rules can be seen in the Rules List.
--list-input-changes
gives a list of output files that are affected by the inclusion of new data. This can be used to automatically trigger a --forceun
with the following code.
snakemake -cores all --forcerun $(snakemake --list-input-changes) --use-conda
file_append #00 Append .fasta.gz files split over multiple lanes
trimmomatic_trim #01a Trim .fasta.gz files
trimmomatic_trim_lanes #01b Variant rule to trim appended files
spades_assembly #02 Construct contig assembly
blast_mkblastdb #03 Make BLAST database for sample assembly
blast_search #04 Search BLAST database for query gene
blasttogff #05 Retrieve identified BLAST sequences
helper_condense #06 (temp dir) Helper rule to specify which files to condense.
fasta_condense #07 Concatenate retrieved sequences from samples for each query
mafft_align #08 Align retrieved sequences for each query