ASPIRE (ASsembly Pipeline with Iterative REfinement) is a pipeline for constructing virus-sized genomes out of NGS reads (short reads).
This is release 1.0. Copyright © 2021 Lee Sau Dan (sdlee@cse.cuhk.edu.hk)
See LICENSE.
- Perl 5 (recommended: v5.32.1 or later)
with the following modules:
- App::Cmd::Setup
- Bio::DB::Sam
- Bio::Seq
- Bio::SeqIO
- Cwd
- File::Path
- File::Slurp
- File::Spec
- IPC::Run
- List::Util
- Math::Round
- Statistics::Descriptive::Full
- samtools (v1.13 or later)
- bcftools (v1.13 or later)
- cutadapt (v3.4 or later)
- SPAdes (v3.13.1 or later)
- alternative to SPAdes: SGA (v0.10.15 or later)
- MUMmer (v3.23 or later; version 4.x also supported)
- Bowtie2 (v2.4.2 or later)
- BWA (v0.7.17 or later)
- GapFiller (v1-10 or later) PubMed: 22731987
No installation is need.
The file aspire is a Perl script.
Just copy it to somewhere under $PATH
and give it execute permission (chmod +x aspire
).
Then, you can run it with the aspire
command.
Alteratively, you may invoke it via: perl aspire
.
aspire
is a command line tool, which has a dozen subcommands.
Running it with aspire --help
displays a copyright message
and a list of global options and commands.
The first non-option argument is considered a subcommand,
invoking the various parts of the ASPIRE pipeline.
Each subcommand has its own help page, which can be obtained with
aspire --help <subcommand>
output:
ASPIRE 1.0 --- ASsembly Pipeline with Iterative REfinement
Copyright (C) 2021 LEE Sau Dan <sdlee@cse.cuhk.edu.hk>
This program comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under certain
conditions; see LICENSE, or see <https://www.gnu.org/licenses/>
aspire [-jmt] [long options...] <command> [opts ...] <args> ...
-j STR --job-directory STR job directory
-t INT --threads INT max number of threads
(default value: 4)
-m INT --memory INT max memory to use (GB)
(default value: 64)
Available commands:
commands: list the application's commands
help: display a command's help screen
align: Align trimmed reads to constructed genome and compute statistics.
assemble: Denovo assembly of trimmed reads.
correct: Perform the 'correcting' step of the iterative loop.
fill: Perform the 'gap-filling' step of the iterative loop.
new: Create a new job.
result: Extract constructed genome.
run: Run the ASPIRE pipeline.
run-pass: Run a single iteration of the iterative loop.
stats: Statistics gathered from the 'align' command.
tile: Perform the 'tiling' step of the iterative loop.
trim: Trim input raw reads.
version: display an app's version
wrap: Try to wrap a gap-filled genome around (for circular genomes)
Suppose you start from gzipped FASTQ files,
reads_1.fastq.gz
and reads_2.fastq.gz
,
containing paired-end WGS reads for a sample.
You would like to construct the virus genome for this sample,
based on a reference genome for the virus in the file virus_ref.fasta
.
You begin with the new
subcommand:
aspire --job-dir job1 new reads_virus_ref.fasta 1.fastq.gz reads_2.fastq.gz
This creates a new directory job1
and
initializes it with symbolic links pointing to the given files.
Next, use the run
command to run the ASPIRE pipeline:
aspire --job-dir job1 run 2
Here, the argument 2
tells ASPIRE to stop after 2 iterative refinements.
The run
command does many things:
- It invokes the
trim
subcommand to trim the reads to remove adapters. - Then it invokes the
assemble
command to construct scaffolds out of the trimmed reads by denovo assembly. By default, SPAdes is used. You may use the--sga
option of therun
subcommand to override this default and use SGA instead. - Next, it invokes the
run-pass
command twice (or any number of times specified) with the appropriate arguments. Therun-pass
command, in turn, invokes the commandstile
,correct
andfill
commands to carry one iterative refinment pass. - If the
run
command was given the--wrap
, then an additional round of refinement is done by invoking thewrap
subcommand, followed bycorrect
andfill
.
Finally, to retrieve the constructed virus genome,
use the result
subcommand:
aspire --job-dir job1 --id vg01 --description "My new genome" > vg01.fasta
Note that the subcommands run
, run-pass
are simply convenient
functions for invoking many other subcommands.
If a step in ASPIRE fails,
you may manually fix the problems
and continue from where it broke
by invoking the other subcommands manually.
After constructing the new genome, you may want to gather various statistics, such as alignment rates and mapping qualities. This is done in 2 steps.
-
Use the
align
command to align the reads to the constructed genomes:aspire align --pass all
-
Use the
stats
command to gather the results:aspire stats alignment-rates > aln-rates.tsv aspire stats mapq > mapq.tsv
That's all folks.
2021-09-12 Lee Sau Dan