ASPIRE

ASPIRE (ASsembly Pipeline with Iterative REfinement) is a pipeline for constructing virus-sized genomes out of NGS reads (short reads).

License

See LICENSE.

Prerequisites

Perl 5 (recommended: v5.32.1 or later) with the following modules:
- App::Cmd::Setup
- Bio::DB::Sam
- Bio::Seq
- Bio::SeqIO
- Cwd
- File::Path
- File::Slurp
- File::Spec
- IPC::Run
- List::Util
- Math::Round
- Statistics::Descriptive::Full
samtools (v1.13 or later)
bcftools (v1.13 or later)
cutadapt (v3.4 or later)
SPAdes (v3.13.1 or later)
- alternative to SPAdes: SGA (v0.10.15 or later)
MUMmer (v3.23 or later; version 4.x also supported)
Bowtie2 (v2.4.2 or later)
BWA (v0.7.17 or later)
GapFiller (v1-10 or later) PubMed: 22731987

Installation

No installation is need. The file aspire is a Perl script. Just copy it to somewhere under $PATH and give it execute permission (chmod +x aspire). Then, you can run it with the aspire command. Alteratively, you may invoke it via: perl aspire.

Running ASPIRE

aspire is a command line tool, which has a dozen subcommands. Running it with aspire --help displays a copyright message and a list of global options and commands. The first non-option argument is considered a subcommand, invoking the various parts of the ASPIRE pipeline. Each subcommand has its own help page, which can be obtained with

aspire --help <subcommand>

output:

ASPIRE 1.0 --- ASsembly Pipeline with Iterative REfinement
Copyright (C) 2021 LEE Sau Dan <sdlee@cse.cuhk.edu.hk>

This program comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under certain
conditions; see LICENSE, or see <https://www.gnu.org/licenses/>

aspire [-jmt] [long options...] <command> [opts ...] <args> ...
        -j STR --job-directory STR  job directory
        -t INT --threads INT        max number of threads
                                    (default value: 4)
        -m INT --memory INT         max memory to use (GB)
                                    (default value: 64)

Available commands:

  commands: list the application's commands
      help: display a command's help screen

     align: Align trimmed reads to constructed genome and compute statistics.
  assemble: Denovo assembly of trimmed reads.
   correct: Perform the 'correcting' step of the iterative loop.
      fill: Perform the 'gap-filling' step of the iterative loop.
       new: Create a new job.
    result: Extract constructed genome.
       run: Run the ASPIRE pipeline.
  run-pass: Run a single iteration of the iterative loop.
     stats: Statistics gathered from the 'align' command.
      tile: Perform the 'tiling' step of the iterative loop.
      trim: Trim input raw reads.
   version: display an app's version
      wrap: Try to wrap a gap-filled genome around (for circular genomes)

Suppose you start from gzipped FASTQ files, reads_1.fastq.gz and reads_2.fastq.gz, containing paired-end WGS reads for a sample. You would like to construct the virus genome for this sample, based on a reference genome for the virus in the file virus_ref.fasta. You begin with the new subcommand:

aspire --job-dir job1 new reads_virus_ref.fasta 1.fastq.gz reads_2.fastq.gz

This creates a new directory job1 and initializes it with symbolic links pointing to the given files.

Next, use the run command to run the ASPIRE pipeline:

aspire --job-dir job1 run 2

Here, the argument 2 tells ASPIRE to stop after 2 iterative refinements. The run command does many things:

It invokes the trim subcommand to trim the reads to remove adapters.
Then it invokes the assemble command to construct scaffolds out of the trimmed reads by denovo assembly. By default, SPAdes is used. You may use the --sga option of the run subcommand to override this default and use SGA instead.
Next, it invokes the run-pass command twice (or any number of times specified) with the appropriate arguments. The run-pass command, in turn, invokes the commands tile, correct and fill commands to carry one iterative refinment pass.
If the run command was given the --wrap, then an additional round of refinement is done by invoking the wrap subcommand, followed by correct and fill.

Finally, to retrieve the constructed virus genome, use the result subcommand:

aspire --job-dir job1 --id vg01 --description "My new genome" > vg01.fasta

Note that the subcommands run, run-pass are simply convenient functions for invoking many other subcommands. If a step in ASPIRE fails, you may manually fix the problems and continue from where it broke by invoking the other subcommands manually.

Collecting statistics

After constructing the new genome, you may want to gather various statistics, such as alignment rates and mapping qualities. This is done in 2 steps.

Use the align command to align the reads to the constructed genomes:
```
 aspire align --pass all
```

Use the stats command to gather the results:

 aspire stats alignment-rates > aln-rates.tsv
 aspire stats mapq > mapq.tsv

That's all folks.
2021-09-12 Lee Sau Dan