Gitter

SIDEARM - Your weapon for viral discovery in the NCBI SRA database

Sidearm searches the SRA database for viruses using the NCBI magicBLAST tool. It generates a table describing the number of alignments to each virus and various metrics such as the sequence coverage and average depth. The reads aligning to virus are assembled into viral contigs to attempt to generate complete viral genomes.

Installation

Clone the repository

git clone https://github.com/NCBI-Hackathons/Virus_Detection_SRA

Dependencies

Install the following:

Example Workflow

This example uses an RNA-seq dataset (SRR1553459) from an Ebola virus outbreak. After cloning this repository, do the following:

cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/848/505/GCF_000848505.1_ViralProj14703/GCF_000848505.1_ViralProj14703_genomic.fna.gz
gunzip GCF_000848505.1_ViralProj14703_genomic.fna.gz
makeblastdb -dbtype nucl -in GCF_000848505.1_ViralProj14703_genomic.fna -out ebolazaire -parse_seqids
export BLASTDB=$BLASTDB:`pwd`

These steps downloaded the Ebola virus genome and uncompressed it. Using the Ebola virus genome, a BLAST database was created with makeblastdb. Then your local directory was added to the BLASTDB environmental variable.

sidearm.cwl sidearm.SRR1553459.ebola.yml

This steps runs Sidearm and generates the following primary output files.

  • SRR1553459.ebolazaire.bam - magicblast alignments to Ebola virus (sorted BAM file)
  • SRR1553459.ebolazaire.bam.summarize.tsv - aggregate information about the alignments to each reference sequence (in this case it is only Ebola virus)
  • SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.fa - the contigs generated by the assembly of all sequences that aligned to the reference sequences.

The log files for the trim and assembly modules are also created

  • SRR1553459.ebolazaire.bam.fa.trim.log - the trim module logfile
  • SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.log - the assembly module logfile

Expected Results

Open the summarize.tsv file in a spreadsheet program. The number of alignments to Ebola virus should be ~15,000 (column 'aligns'), sequence coverage ~98% (column 'seqcov'), and average depth ~75 (column 'avgdepth'). The longest contig is ~12,500bp and a BLASTN search shows that it is Ebola virus.

The avg (average) fields are the average MAPQ (or Score or EditDist) across the alignments for each subject in the BAM file. MAPQ, Score and EditDist are taken from field 5, the AS flag, and the NM flag in the BAM file, respectively.

Example workflow for all NCBI RefSeq viruses

cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
gunzip viral.1.1.genomic.fna.gz
makeblastdb -dbtype nucl -in viral.1.1.genomic.fna -out viral.1.1.genomic -parse_seqids
export BLASTDB=$BLASTDB:`pwd`

Edit sidearm.SRR1553459.ebola.yml with the following changes and save it as sidearm.SRR073726.viral11genomic.yml:

  • srr: SRR073726
  • blastdb: viral.1.1.genomic
  • path: viral.1.1.genomic.fna

Then execute Sidearm with the command line below. Depending on your computer, this will take about 1 hour.

sidearm.cwl sidearm.SRR073726.viral11genomic.yml

Expected results

Report of alignments (summarize.tsv)

id vname vlen seqcov avgdepth aligns avgMAPQ avgScore avgEditDist
NC_032111.1 BeAn 58058 virus 163005 0.78 6.7 51,012 255 22.7 0.34
NC_001357.1 HPV18 7857 22.2 130.7 26,067 255 39.1 0.05

The longest contig should be ~600bp. However, obtaining this result depends on the software and reference sequences versions that you used. A BLASTN search with the longest contig sequence should show that it is Human Papillomavirus 18.

Troubleshooting

  • Please submit an issue if you run into any problems installing or running this software.