Darth

Darth is a container for running VADR and other tools for annotating novel coronavirus genomes with no near neighbor.

Installation

Note: For VADR to run smoothly, the container should be provided with at least 64G of virtual memory. If you don't have that much RAM to spare, consider creating a swapfile on SSD-based instance storage.

You can fetch the following repo from DockerHub: taltman/darth:maul

Running

For an example of running darth against Frankie, run the following:

make test-docker-frankie

Arguments to darth.sh

SRA accession
Path to input genome FASTA file
Path to single (compressed) FASTQ file with all of the reads corresponding to the SRA accession. Or enter "none" if no reads
Data directory (leave this as "/root/data")
Top-level output directory path. This directory is the one that VADR will try to create its own output directory inside of. So this directory should already exist, and will be mounted by Docker for the image to access in read/write mode. In the Makefile example, this is also where the genome and FASTQ files are placed.
Number of CPUs for various programs within Darth to utilize

Arguments to canonicalize_contigs.sh

Path to input genome FASTA file
Top-level output directory path. Will create a 'transeq' sub directory
Data directory (leave this as "/root/data")

Output:

A directory called transeq that has a file canonical.fna, that has the assembly with the contigs rearranged, with some of the contigs reverse-complemented, as needed. Also, the alignments.fasta file in this directory should be used for tree-building.

Algorithm used by canonicalize_contigs.sh

First, build a model from a trusted sequence, like RefSeq or GenBank (for now, this is the RefSeq SARS-CoV-2 genome, but more can easily be built):

Obtain sequence
Use transeq to get 6-frame translation of the whole genome
Use Pfam to annotate translations
Sort Pfam hits by alignment start base
Create two-column file associating Pfam model name with the sort order

Now, analyze the assembly: