/virID

Viral Identification and Discovery - A viral characterization pipeline built in Nextflow.

Primary LanguagePythonMIT LicenseMIT

virID

Viral Identification and Discovery - A viral characterization pipeline built in Nextflow. https://doi.org/10.1182/bloodadvances.2019001260

Pipeline schematic

Quickstart

The only software requirements for running this pipeline are the Conda package manager and Nextflow version 19.10.0 or higher. When the pipeline is first initiated, Nextflow will create a conda virtual environment containing all required additional software, and all processess will run in this in virtual environment. The only other things you need to take care of are making the DIAMOND and megablast databases, as well as making any changes to the executor depending on the cluster infrastructure you plan to run this pipeline on. There are more detailed descriptions of these steps below.

  1. Download Nextflow. The following script will download the Nextflow executible in your current directly. Put this somewhere in your PATH.
    curl -s https://get.nextflow.io | bash

  2. Prepare DIAMOND and megablast databases.

  3. Configure input parameters in nextflow.config.

  4. Change executor in nextflow.config if not using a SLURM cluster.

  5. nextflow run virID.nf --reads "path/to/reads/*fastq" --out_dir "path/to/output/dir". If you need to restart the pipeline for any reason, add the -resume switch and any finished processess won't be rerun!

Description

The purpose of virID is to assemble and classify microorganisms present in next-generation sequencing data. This pipeline can be run in "assembly mode" or "read mode", as determined by the assembly_pipeline and reads_pipeline parameters. The main difference is whether contigs are generated with SPAdes, or if the reads are queried without assembly. The pipeline steps are listed below, but note that many steps will run in parallel.

Assembly Mode:

  1. Input fastqs are split into a paired file containing interleaved paired reads, and an unpaired file containing unpaired reads. Thus, each sample should be added to this pipeline as a single fastq containing both paired and unpaired reads.
  2. Reads are assembled with the SPAdes assembler - SPAdes, metaSPAdes, or rnaSPAdes can be specified.
  3. bwa mem is used to map reads back to contigs.
  4. The DIAMOND and megablast aligners are used for taxonomic assignment of assembled contigs.
  5. DIAMOND and megablast output files are translated to a taxonomic output following last-common-ancestor (LCA) calculation for each query contig.
  6. Contigs are queried with megablast against a nonredundant database of common cloning vectors. Contigs that are assigned to these sequences are flagged.
  7. DIAMOND and megablast taxonomic outputs and contig count information are merged to a comprehensive taxonomic output, and unassigned contigs are flagged. Counts outputs, one for each DIAMOND and megablast, are generated which display the counts at every taxonomic level. There is a section below that describes the output files in more detail.

Read Mode:

  1. Input fastqs are split into a paired file containing interleaved paired reads, and an unpaired file containing unpaired reads. Thus, each sample should be added to this pipeline as a single fastq containing both paired and unpaired reads.
  2. Reads are converted to .fasta format.
  3. Each read is queried by megablast and DIAMOND.
  4. The megablast and DIAMOND output files are translated to a taxonomic output following last-common-ancestor (LCA) calculation for each query contig.
  5. A counts output file that lists the number of reads assigned to each taxon at every level is generated from the translated megablast and translated DIAMOND output files. These outfiles are described in depth later in this readme.
  6. Reads are queried with megablast against a nonredundant database of common cloning vectors. Reads that are assigned to these sequences are marked in an output file.

While this pipeline is organized in Nextflow, every process is capabale of being used and tested independently of Nextflow as a bash or python script. Execute any python script (with -h) or bash script (without arguments) for detailed instructions on how to run them. The conda environment with all dependencies can be installed with conda env create -f resouces/virID_environment.yml.

Prepare databases

This workflow requires you to generate two databases: 1) A DIAMOND database for search, 2) A nucleotide blast database for search. You should activate the pipeline conda environment with conda env create -f resouces/virID_environment.yml so that you are making the DIAMOND and blast databases with the same version of DIAMOND and blast used by this pipeline.

A DIAMOND database can be generated by following the DIAMOND manual. I recommend generating a database using the RefSeq nonredundant protein fastas present at [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/]. When making the DIAMOND database with diamond makedb you must use the --taxonmap and --taxonnodes switches.

diamond makedb \
--in <FASTA> \
-d <path to output database> \
--taxonmap <path to prot.accession2taxid.gz from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz> \
--taxonnodes <path to nodes.dmp from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.

For blast seach, you must download the blast v5 nucleotide database using the same version of blast+ used in this script.
update_blastdb.pl --blastdb_version 5 nt_v5 --decompress
There is sometimes a bug where this database will not work - you may have to delete the last line of nt_v5.nal that specifies a volume of the blast database that is not actually present.

In addition to the blast and DIAMOND databases you need to generate, a nucleotide contaminant blast database is included with this pipeline. This database was generated from Univec_ core, which is a nonredundant list of common laboratory cloning vectors. I have also added some additional vectors to this database. Note: You need to specify the path to this database in nextflow.config.

Finally, the get_LCA.py script uses ETE3 to look up taxonomy information from a downloaded taxonomy database. When this script is first executed, it will automatically download a ~300MB taxonomy database to ~/.etetoolkit/taxa.sqlite. One thing to note is that if your home directory is somewhere with slow I/O, you should make a symbolic link to somewhere faster prior to downloading the database, i.e. run ln -s /faster/directory/etetoolkit ~/.etetoolkit prior to running get_LCA.py. In my case, I make a symbolic link from my home directory to my clusters scratch directory. For reference, get_LCA.py script should run in about 40s on a BLAST output file of 500K lines.

Configure executor and resources

Executor: The Nextflow executor, explained here, dictates how Nextflow will run each process. virID is currently set up to use a SLURM cluster, but you can easily change this by altering the executor in nextflow.config. Nextflow takes care of all cluster submissions and automatically parallelizes everything. If you are using a different cluster infrastructure, change the "executor" value from 'slurm' to the appropriate infrastructure. In addition, each nextflow module (in bin/modules/) contains a beforeScript line that dictates code to run prior to running the module. Here, I have added module load gcc conda2, which loads the GCC compiler and the conda package manager in my slurm cluster. If this is not relevant to you, remove this code.

Resources: I have set up virID with dynamic resources! Each process will request a different amount of resources depending on the size of the input files. In addition, upon failure the process will restart with increased resources, and will restart a up to three times (configurable with the maxRetries setting in nextflow.config).

Inputs and parameters

In general, all input values and parameters for this script must be entered in the nextflow.config file. Most of the parameters are already filled out for you, but you may want to change some of them.

Input and output

params.out_dir: Desired output directory. "output"
params.reads: A glob detailing the location of reads to be processed through this pipeline. NOTE, input reads for one sample should be in one fastq. This script will process the fastq into an interleaved (paired) fastq and a separate unpaired fastq. There should be a single fastq for each input sample. A value of "raw_reads/*fastq" is an appropriate input for this parameter, and will select as input all fastq files in the directory raw_reads. Make sure to wrap in quotes! "raw_data/*fastq"

Specify workflow strategy

params.assembly_pipeline: Options are "T" or "F". If marked "T", the pipeline will run in assembly mode. NOTE - only params.assembly_pipeline OR params.reads_pipeline may be marked "T". "T"
params.reads_pipeline: Options are "T" or "F". If marked "T", the pipeline will run in read mode without assembly. NOTE - only params.assembly_pipeline OR params.reads_pipeline may be marked "T". "T"

Reads pipeline specific parameters

params.reads_pipeline_no_diamond: Options are "T" or "F". If marked "T", and using the reads pipeline, diamond will not be run - only blast. "F"
params.reads_pipeline_no_blast: Options are "T" or "F". If marked "T", and using the reads pipeline, blast will not be run - only diamond. "F"

Conda

params.conda_env_location: Location you want the conda virtual environment to be saved to. Change this to somewhere convenient for you. It lets you avoid downloading the conda environment multiple times.

SPAdes

params.spades_type: Options are 'meta', 'rna', and 'regular'. This specifies whether to use metaspades.py, rnaspades.py, or spades.py. "meta"
params.temp_dir: This specifies the location of the SPAdes temporary directory. "temp"
params.spades_min_length: The minimum contig length that is output from the SPAdes process. Contigs below this length will be filtered out. 300

DIAMOND

params.diamond_database: Path to the diamond database. Wrap all paths in quotes!
params.diamond_evalue: The maximum evalue of a match to be reported as an alignment by DIAMOND. In general I am a fan of setting this at 10, which is quite high. Lower values, such as 0.001, are more stringent and have fewer false positives. "10"
params.diamond_outfmt: This dictates the output format from DIAMOND. This is fairly flexible, but generally staxids and bitscore are required. I recommend leaving it at default, though you can add to it no problem. "6 qseqid stitle sseqid staxids evalue bitscore pident length"

BLAST

params.blast_database: Path to the blast nt database. Wrap all paths in quotes!
params.blast_evalue: The maximum evalue of a match to be reported as an alignment by blast. In general I am a fan of setting this at 10, which is quite high. Lower values, such as 0.001, are more stringent and have fewer false positives. 10
params.blast_type: This controls the 'task' switch of blast, and specified whether to run 'megablast', 'dc-megablast', or 'blastn'. I recommend 'megablast', else it will be quite slow. megablast
params.blast_outfmt: This dictates the output format from BLAST. This is fairly flexible, but generally staxid and bitscore are required. I recommend leaving it at default, though you can add to it no problem. "6 qseqid stitle sseqid staxid evalue bitscore pident length"
params.blast_log_file: This is a file that will contain log information from the blast bash script. "blast.log"
params.blast_max_hsphs: This is the maximum number of alignments per query-subject pair. Recommend setting at 1. 1
params.blast_max_targets: This is the maximum number of separate hits that are allowed for each query. 30
params.blast_restrict_to_taxids: This parameter lets you limited the blast search to a specified taxonID. This causes an extreme speedup, and is helpful for testing this pipeline. Not compatible with params.blast_ignore_taxids. "no"
params.blast_ignore_taxids: This parameter lets you ignore all hits of a particular taxonID. "no"

BLAST/DIAMOND conversion

params.within_percent_of_top_score: When finding the LCA of all matches for a given query sequence, this details how close to the maximum bitscore a match must be to be considered in the LCA classification. If this is set at 1, for example, all potential alignments within 1 percent of the highest bitscore for a query sequence will be considered in the LCA classification. NOTE: This is limited intrinsically by the DIAMOND -top parameter, which is set at 1. Thus, DIAMOND will only output assignments within 1% of the top bitscore anyway. I will add a switch to change the DIAMOND -top parameter in a future release. 1
params.taxid_blacklist: Path to a file containing taxonIDs to be blacklisted. I have included a file in this github repository. Assignments containing one of these taxonIDs will be discarded before LCA calculation. "$VID/resources/2019-08-09_blacklist.tsv"
params.diamond_readable_colnames: These are the more-readable column names that will be reported in the output from DIAMOND. If you change the outfmt, change this line accordingly. "query_ID seq_title seq_ID taxonID evalue bitscore pident length"
params.blast_readable_colnames: These are the more-readable column names that will be reported in the output from BLAST. If you change the outfmt, change this line accordingly. "query_ID seq_title seq_ID taxonID evalue bitscore pident length"
params.taxonomy_column: This details which of the colnames contains the taxonID. "taxonID"
params.score_column: This details which of the colnames should be used for calculating the top score. I use bitscore, but you could technically set this as pident or length or evalue to sort by one of those parameters instead. "bitscore"

Contaminant blast search

params.blast_contaminant_database: Path to the contaminate database. This database is included at resources/vector_contaminant_database, but you need to specify the path here!
params.blast_contaminant_evalue: Matches with evalues above this will not be reported. I recommend setting this to be pretty stringent. 0.001
params.blast_contaminant_outfmt: Outformat. The only thing that really is critical is keeping qseqid as the first column. "6 qseqid stitle sseqid staxid evalue bitscore pident length"
params.blast_contaminant_log_file: Path to the log file to be used for this step. "blast_contaminant.log"
params.blast_contaminant_type: The blast type. I recommend megablast, because that is sensitive enough for this. megablast
params.blast_contaminant_max_hsphs: Total number of alignments between each query-subject pair. 1
params.blast_contaminant_max_targets: Total number of alignments. 1

Description of output files

  1. Each process will copy its outfiles to params.out_dir. You can disable this setting my removing the publishDir line from each module file.
  2. The our_dir/results dir will contain the main pipeline outputs:

sampleID_merged.tsv: This is a tab-delimited file containing assignment information for each contig. Each contig takes up two lines, where one line details the BLAST assignment information and one details the DIAMOND assignment information. Columns from seq_title to length are a comma-delimited list detailing information from all DIAMOND or blast hits that were considered in the LCA calculation. Starting at LCA_taxonID, the taxonomic information is that of the LCA of these hits.

Column names include:
query_ID: Contig name
seq_title: Title of hits (comma-delimited list)
seq_ID: Acession IDs of hits (comma-delimited list)
taxonID: taxonID of hits (comma-delimited list)
evalue: Evalues of hists (comma-delimited list)
bitscore: Bitscores of hits (comma-delimited list)
pident: Percent identity of each hit (comma-delimited list)
length: Length of the alignment for each hit (comma-delimited list)
LCA_taxonID: The taxonID of the LCA of each hit.
superkingdom: LCA superkingdom
kingdom: LCA kingdom
phylum: LCA phylum
class: LCA class
order: LCA order
family: LCA family
genus: LCA genus
species: LCA species
strain: LCA strain
read_count: Number of input reads that map to the contig.
average_fold: Average fold coverage of the input reads on the contig.
covered_percent: Percentage of the contig covered by input reads.
potential_contaminant: 1 if assigned to the contaminant vector database, 0 otherwise. (Column excluded if no contaminants found).

sampleID_blast_counts.tsv and ${sampleID}_diamond_counts.tsv: Here, the read_count for each contig is distributed to each taxonomic level of the LCA of that contig. For example, if a contig has an LCA of sk__superkingdom/k__kingdom/f__polyomaviridae and has a read_count of 10, 10 reads are assigned to each f__polyomaviridae, k__kingdom, and sk__superkingdom. These outputs detail the results based on the blast or DIAMOND assignments, respectively. Note that if a contig is assigned by megablast to the vector contaminate database the counts from that contig will NOT be included in this output. The columns include:
taxonID The taxonID
lineage The name of each taxon in the lineage of the taxonID.
superkingdom The superkingdom of the taxonID
taxon The name of the taxon
level The level of the taxon (i.e. kingdom, or family, etc)
count The total number of reads assigned to that taxon.

Citation

Please cite https://doi.org/10.1182/bloodadvances.2019001260

This pipeline is nothing without the excellent programs virID makes use of. Please cite these programs in addition to virID if you use virID in any publication! I'll list them below, in no particular order:
SPAdes
seqtk
BWA
Samtools
bbtools
Blast+
DIAMOND
ETE3
Anytree